What Is a Speech-to-Text Service?
A speech-to-text (STT) service, also called automatic speech recognition (ASR), converts audio input into a text transcript. Applications include voice assistants, meeting transcription, call center analytics, accessibility tooling, and real-time captioning. Designing a production STT service involves audio ingestion and chunking, streaming vs batch transcription architectures, speaker diarization, language detection, confidence scoring, and post-processing pipelines.
Requirements
Functional Requirements
- Accept audio input as file upload (batch) or audio stream (real-time)
- Return full transcript with word-level timestamps
- Support speaker diarization: label which speaker said what
- Detect language automatically or accept explicit language hint
- Return confidence scores at word or segment level
- Post-process output: punctuation, capitalization, number normalization, profanity filtering
Non-Functional Requirements
- Streaming latency: first partial transcript within 300ms of speech onset
- Batch throughput: process hours of audio faster than real-time (e.g., 1 hour of audio in under 5 minutes)
- Word error rate (WER): competitive with human transcription for clean audio (<5% WER)
- Multi-language support
- Scalable: handle thousands of concurrent streams
High-Level Architecture
- Audio Ingestion Layer: accepts file uploads or WebSocket/gRPC streams; validates format and sample rate
- Audio Preprocessor: resample to 16kHz mono, apply VAD (voice activity detection) to strip silence, chunk long audio
- ASR Engine: runs acoustic model and language model (e.g., Whisper, Wav2Vec2, or a custom CTC/RNN-T model)
- Diarization Service: identifies and labels speaker turns
- Language Detector: identifies spoken language if not specified
- Post-Processor: adds punctuation, normalizes numbers and dates, applies domain vocabulary corrections
- Results Store: stores transcripts for batch jobs; streams partial results for real-time jobs
Audio Chunking
ASR models have a maximum context window. Long audio must be split into chunks before inference:
- Use VAD (voice activity detection) to find natural silence boundaries for splitting; avoid cutting in the middle of words
- Overlap adjacent chunks by 0.5-1 second to handle words that span a boundary; deduplicate overlapping output in post-processing
- Typical chunk size: 30 seconds (Whisper's native context window); shorter chunks reduce latency but increase per-chunk overhead
- For very long audio (hours), parallelize chunk inference across workers
- Maintain chunk ordering metadata to reassemble the final transcript correctly
Streaming Transcription
Real-time use cases (voice assistants, live captioning) require partial results as audio is produced:
- Accept audio over WebSocket or gRPC bidirectional stream
- Run inference on rolling windows of audio (e.g., every 200-500ms)
- Return partial (interim) transcripts quickly; mark them as unstable
- Return final (stable) transcripts once enough context has accumulated to confirm a word
- RNN-T (Recurrent Neural Network Transducer) models are preferred for streaming because they produce output tokens incrementally without needing a fixed-length context window
- CTC models require a full context window before decoding; less suited to streaming but simpler to implement
- Buffer audio on the server side; do not trust the client to send perfectly-timed chunks
Speaker Diarization
Diarization answers the question: who spoke when? It is a separate pipeline from transcription:
- Speaker embedding extraction: extract d-vector or x-vector embeddings from short audio segments to represent each speaker
- Clustering: cluster segments by speaker embedding similarity (e.g., spectral clustering or agglomerative hierarchical clustering)
- Segment alignment: align diarization output (speaker labels + timestamps) with ASR output (words + timestamps) to produce speaker-tagged words
- Diarization is typically run as a post-processing step on batch audio; real-time diarization is much harder and often uses an online clustering approach
- Number of speakers can be estimated automatically (using BIC or silhouette score) or specified by the caller
Language Detection
When the caller does not specify a language, the service must detect it:
- Run a lightweight language identification model on the first few seconds of audio (e.g., 3-5 seconds is usually sufficient)
- Whisper includes built-in language detection as part of its encoder; this is essentially free if using Whisper
- For mixed-language audio (code-switching), language detection per segment is more appropriate than per-file
- Return detected language and confidence in the API response; allow the caller to override if detection is wrong
- Load the appropriate acoustic model or language model weights based on detected language
Confidence Scoring
Confidence scores tell downstream consumers how much to trust each word or segment:
- CTC and RNN-T models produce per-token probabilities naturally; these can be aggregated to word-level confidence
- Confidence is correlated with audio quality, speaker accent, and vocabulary familiarity
- Low-confidence words can be flagged for human review (useful in call center analytics or medical transcription)
- Calibrate confidence scores against actual accuracy on a validation set; raw model probabilities are often overconfident
- Expose confidence at multiple granularities: per-word, per-segment, and overall transcript
Post-Processing Pipeline
Raw ASR output is often not directly usable. Post-processing improves readability and accuracy:
- Punctuation and capitalization: ASR models trained on speech do not naturally produce punctuation; use a separate punctuation model or fine-tune on punctuated text
- Number normalization: convert spoken forms to written forms (e.g., "forty two dollars" -> "42")
- Domain vocabulary: add custom vocabulary (product names, medical terms, proper nouns) that the base model is unlikely to know; inject via biasing/hotword boosting
- Profanity filtering: replace or redact profanity based on caller preference
- Inverse text normalization (ITN): convert spoken-form numbers and abbreviations to their written canonical forms
- PII redaction: detect and mask personal information (phone numbers, SSNs, credit card numbers) in the transcript
Scaling Streaming Workloads
Streaming ASR is stateful and connection-oriented, which makes scaling harder than stateless batch inference:
- Each active stream ties up a model server connection and its associated state (audio buffer, model hidden state)
- Use a connection load balancer (e.g., an L4 load balancer with sticky sessions) to route all chunks of a stream to the same model server instance
- Scale model server instances based on the number of active connections, not CPU/GPU utilization alone
- Implement graceful connection migration for rolling deploys: drain existing connections before terminating an instance
- For very large scale (millions of concurrent streams), shard streams across clusters by region or user cohort
Failure Modes and Mitigations
- Stream disconnection: buffer the last N seconds on the client; reconnect and resume from last acknowledged timestamp
- Model server overload: shed load with backpressure; return 429 with retry-after header; degrade gracefully by reducing partial result frequency
- Poor audio quality: return low confidence scores; surface warnings about background noise or clipping to the caller
- Unsupported language: detect early and return a clear error rather than producing garbage output
- Very long silence: detect via VAD and close the stream or send a keep-alive; do not run inference on silent audio
Interview Tips
- Clarify upfront whether the primary use case is streaming or batch; the architecture differs significantly
- Explain audio chunking and VAD before jumping into the ASR model; it shows you understand the preprocessing pipeline
- Mention RNN-T for streaming vs CTC/encoder-decoder for batch; this is a common follow-up question
- Diarization is often asked as a follow-up; know the basic pipeline (embedding, clustering, alignment)
- Discuss confidence calibration and how downstream consumers use confidence scores
- Post-processing is often overlooked; mentioning ITN, punctuation, and custom vocabulary shows production experience
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering
See also: Meta Interview Guide 2026: Facebook, Instagram, WhatsApp Engineering