Low Level Design: Speech-to-Text Service

What Is a Speech-to-Text Service?

A speech-to-text (STT) service, also called automatic speech recognition (ASR), converts audio input into a text transcript. Applications include voice assistants, meeting transcription, call center analytics, accessibility tooling, and real-time captioning. Designing a production STT service involves audio ingestion and chunking, streaming vs batch transcription architectures, speaker diarization, language detection, confidence scoring, and post-processing pipelines.

Requirements

Functional Requirements

Accept audio input as file upload (batch) or audio stream (real-time)
Return full transcript with word-level timestamps
Support speaker diarization: label which speaker said what
Detect language automatically or accept explicit language hint
Return confidence scores at word or segment level
Post-process output: punctuation, capitalization, number normalization, profanity filtering

Non-Functional Requirements

Streaming latency: first partial transcript within 300ms of speech onset
Batch throughput: process hours of audio faster than real-time (e.g., 1 hour of audio in under 5 minutes)
Word error rate (WER): competitive with human transcription for clean audio (<5% WER)
Multi-language support
Scalable: handle thousands of concurrent streams

High-Level Architecture

Audio Ingestion Layer: accepts file uploads or WebSocket/gRPC streams; validates format and sample rate
Audio Preprocessor: resample to 16kHz mono, apply VAD (voice activity detection) to strip silence, chunk long audio
ASR Engine: runs acoustic model and language model (e.g., Whisper, Wav2Vec2, or a custom CTC/RNN-T model)
Diarization Service: identifies and labels speaker turns
Language Detector: identifies spoken language if not specified
Post-Processor: adds punctuation, normalizes numbers and dates, applies domain vocabulary corrections
Results Store: stores transcripts for batch jobs; streams partial results for real-time jobs

Audio Chunking

ASR models have a maximum context window. Long audio must be split into chunks before inference:

Use VAD (voice activity detection) to find natural silence boundaries for splitting; avoid cutting in the middle of words
Overlap adjacent chunks by 0.5-1 second to handle words that span a boundary; deduplicate overlapping output in post-processing
Typical chunk size: 30 seconds (Whisper's native context window); shorter chunks reduce latency but increase per-chunk overhead
For very long audio (hours), parallelize chunk inference across workers
Maintain chunk ordering metadata to reassemble the final transcript correctly

Streaming Transcription

Real-time use cases (voice assistants, live captioning) require partial results as audio is produced:

Accept audio over WebSocket or gRPC bidirectional stream
Run inference on rolling windows of audio (e.g., every 200-500ms)
Return partial (interim) transcripts quickly; mark them as unstable
Return final (stable) transcripts once enough context has accumulated to confirm a word
RNN-T (Recurrent Neural Network Transducer) models are preferred for streaming because they produce output tokens incrementally without needing a fixed-length context window
CTC models require a full context window before decoding; less suited to streaming but simpler to implement
Buffer audio on the server side; do not trust the client to send perfectly-timed chunks

Speaker Diarization

Diarization answers the question: who spoke when? It is a separate pipeline from transcription:

Speaker embedding extraction: extract d-vector or x-vector embeddings from short audio segments to represent each speaker
Clustering: cluster segments by speaker embedding similarity (e.g., spectral clustering or agglomerative hierarchical clustering)
Segment alignment: align diarization output (speaker labels + timestamps) with ASR output (words + timestamps) to produce speaker-tagged words
Diarization is typically run as a post-processing step on batch audio; real-time diarization is much harder and often uses an online clustering approach
Number of speakers can be estimated automatically (using BIC or silhouette score) or specified by the caller

Language Detection

When the caller does not specify a language, the service must detect it:

Run a lightweight language identification model on the first few seconds of audio (e.g., 3-5 seconds is usually sufficient)
Whisper includes built-in language detection as part of its encoder; this is essentially free if using Whisper
For mixed-language audio (code-switching), language detection per segment is more appropriate than per-file
Return detected language and confidence in the API response; allow the caller to override if detection is wrong
Load the appropriate acoustic model or language model weights based on detected language

Confidence Scoring

Confidence scores tell downstream consumers how much to trust each word or segment:

CTC and RNN-T models produce per-token probabilities naturally; these can be aggregated to word-level confidence
Confidence is correlated with audio quality, speaker accent, and vocabulary familiarity
Low-confidence words can be flagged for human review (useful in call center analytics or medical transcription)
Calibrate confidence scores against actual accuracy on a validation set; raw model probabilities are often overconfident
Expose confidence at multiple granularities: per-word, per-segment, and overall transcript

Post-Processing Pipeline

Raw ASR output is often not directly usable. Post-processing improves readability and accuracy:

Punctuation and capitalization: ASR models trained on speech do not naturally produce punctuation; use a separate punctuation model or fine-tune on punctuated text
Number normalization: convert spoken forms to written forms (e.g., "forty two dollars" -> "42")
Domain vocabulary: add custom vocabulary (product names, medical terms, proper nouns) that the base model is unlikely to know; inject via biasing/hotword boosting
Profanity filtering: replace or redact profanity based on caller preference
Inverse text normalization (ITN): convert spoken-form numbers and abbreviations to their written canonical forms
PII redaction: detect and mask personal information (phone numbers, SSNs, credit card numbers) in the transcript

Scaling Streaming Workloads

Streaming ASR is stateful and connection-oriented, which makes scaling harder than stateless batch inference:

Each active stream ties up a model server connection and its associated state (audio buffer, model hidden state)
Use a connection load balancer (e.g., an L4 load balancer with sticky sessions) to route all chunks of a stream to the same model server instance
Scale model server instances based on the number of active connections, not CPU/GPU utilization alone
Implement graceful connection migration for rolling deploys: drain existing connections before terminating an instance
For very large scale (millions of concurrent streams), shard streams across clusters by region or user cohort

Failure Modes and Mitigations

Stream disconnection: buffer the last N seconds on the client; reconnect and resume from last acknowledged timestamp
Model server overload: shed load with backpressure; return 429 with retry-after header; degrade gracefully by reducing partial result frequency
Poor audio quality: return low confidence scores; surface warnings about background noise or clipping to the caller
Unsupported language: detect early and return a clear error rather than producing garbage output
Very long silence: detect via VAD and close the stream or send a keep-alive; do not run inference on silent audio

Interview Tips

Clarify upfront whether the primary use case is streaming or batch; the architecture differs significantly
Explain audio chunking and VAD before jumping into the ASR model; it shows you understand the preprocessing pipeline
Mention RNN-T for streaming vs CTC/encoder-decoder for batch; this is a common follow-up question
Diarization is often asked as a follow-up; know the basic pipeline (embedding, clustering, alignment)
Discuss confidence calibration and how downstream consumers use confidence scores
Post-processing is often overlooked; mentioning ITN, punctuation, and custom vocabulary shows production experience

{ “@context”: “https://schema.org”, “@type”: “FAQPage”, “mainEntity”: [ { “@type”: “Question”, “name”: “What is a speech-to-text service and how does audio chunking work?”, “acceptedAnswer”: { “@type”: “Answer”, “text”: “A speech-to-text service converts spoken audio into written transcripts via an API. Long audio files are split into overlapping chunks—commonly 30-second windows with a few seconds of overlap—so that each chunk can be processed independently by the ASR model without exceeding context limits. Overlap ensures that words near chunk boundaries are not lost, and the overlap region is used to stitch the partial transcripts back into a seamless sequence.” } }, { “@type”: “Question”, “name”: “How does streaming transcription differ from batch transcription?”, “acceptedAnswer”: { “@type”: “Answer”, “text”: “Streaming transcription processes audio in real time as it is captured, emitting partial and final hypothesis transcripts with low latency suitable for live captioning or voice assistants. Batch transcription receives a complete audio file, processes it offline at higher throughput and lower cost, and returns a single finalized transcript. Streaming models often use a CTC or RNN-T architecture optimized for incremental decoding while batch models can use more compute-intensive beam search.” } }, { “@type”: “Question”, “name”: “What is speaker diarization and how is it implemented?”, “acceptedAnswer”: { “@type”: “Answer”, “text”: “Speaker diarization is the process of segmenting a transcript to identify who spoke each utterance. It is implemented by extracting speaker embeddings such as d-vectors or x-vectors from short audio segments and clustering those embeddings to form speaker identities. The resulting speaker clusters are aligned with the ASR transcript timestamps to annotate each segment with a speaker label. Diarization can run after transcription or jointly with the ASR model in an end-to-end architecture.” } }, { “@type”: “Question”, “name”: “How does language detection work in a speech-to-text service?”, “acceptedAnswer”: { “@type”: “Answer”, “text”: “A language identification model analyzes the first few seconds of audio and outputs a probability distribution over supported languages. The detected language is used to select the appropriate acoustic and language model for the full transcription pass. For code-switching audio where speakers mix languages, per-segment language detection can be applied at chunk boundaries to switch models mid-stream or a multilingual model can handle the full range without switching.” } } ] }