Low Level Design: Speech-to-Text Service

What Is a Speech-to-Text Service?

A speech-to-text (STT) service, also called automatic speech recognition (ASR), converts audio input into a text transcript. Applications include voice assistants, meeting transcription, call center analytics, accessibility tooling, and real-time captioning. Designing a production STT service involves audio ingestion and chunking, streaming vs batch transcription architectures, speaker diarization, language detection, confidence scoring, and post-processing pipelines.

Requirements

Functional Requirements

  • Accept audio input as file upload (batch) or audio stream (real-time)
  • Return full transcript with word-level timestamps
  • Support speaker diarization: label which speaker said what
  • Detect language automatically or accept explicit language hint
  • Return confidence scores at word or segment level
  • Post-process output: punctuation, capitalization, number normalization, profanity filtering

Non-Functional Requirements

  • Streaming latency: first partial transcript within 300ms of speech onset
  • Batch throughput: process hours of audio faster than real-time (e.g., 1 hour of audio in under 5 minutes)
  • Word error rate (WER): competitive with human transcription for clean audio (<5% WER)
  • Multi-language support
  • Scalable: handle thousands of concurrent streams

High-Level Architecture

  • Audio Ingestion Layer: accepts file uploads or WebSocket/gRPC streams; validates format and sample rate
  • Audio Preprocessor: resample to 16kHz mono, apply VAD (voice activity detection) to strip silence, chunk long audio
  • ASR Engine: runs acoustic model and language model (e.g., Whisper, Wav2Vec2, or a custom CTC/RNN-T model)
  • Diarization Service: identifies and labels speaker turns
  • Language Detector: identifies spoken language if not specified
  • Post-Processor: adds punctuation, normalizes numbers and dates, applies domain vocabulary corrections
  • Results Store: stores transcripts for batch jobs; streams partial results for real-time jobs

Audio Chunking

ASR models have a maximum context window. Long audio must be split into chunks before inference:

  • Use VAD (voice activity detection) to find natural silence boundaries for splitting; avoid cutting in the middle of words
  • Overlap adjacent chunks by 0.5-1 second to handle words that span a boundary; deduplicate overlapping output in post-processing
  • Typical chunk size: 30 seconds (Whisper's native context window); shorter chunks reduce latency but increase per-chunk overhead
  • For very long audio (hours), parallelize chunk inference across workers
  • Maintain chunk ordering metadata to reassemble the final transcript correctly

Streaming Transcription

Real-time use cases (voice assistants, live captioning) require partial results as audio is produced:

  • Accept audio over WebSocket or gRPC bidirectional stream
  • Run inference on rolling windows of audio (e.g., every 200-500ms)
  • Return partial (interim) transcripts quickly; mark them as unstable
  • Return final (stable) transcripts once enough context has accumulated to confirm a word
  • RNN-T (Recurrent Neural Network Transducer) models are preferred for streaming because they produce output tokens incrementally without needing a fixed-length context window
  • CTC models require a full context window before decoding; less suited to streaming but simpler to implement
  • Buffer audio on the server side; do not trust the client to send perfectly-timed chunks

Speaker Diarization

Diarization answers the question: who spoke when? It is a separate pipeline from transcription:

  • Speaker embedding extraction: extract d-vector or x-vector embeddings from short audio segments to represent each speaker
  • Clustering: cluster segments by speaker embedding similarity (e.g., spectral clustering or agglomerative hierarchical clustering)
  • Segment alignment: align diarization output (speaker labels + timestamps) with ASR output (words + timestamps) to produce speaker-tagged words
  • Diarization is typically run as a post-processing step on batch audio; real-time diarization is much harder and often uses an online clustering approach
  • Number of speakers can be estimated automatically (using BIC or silhouette score) or specified by the caller

Language Detection

When the caller does not specify a language, the service must detect it:

  • Run a lightweight language identification model on the first few seconds of audio (e.g., 3-5 seconds is usually sufficient)
  • Whisper includes built-in language detection as part of its encoder; this is essentially free if using Whisper
  • For mixed-language audio (code-switching), language detection per segment is more appropriate than per-file
  • Return detected language and confidence in the API response; allow the caller to override if detection is wrong
  • Load the appropriate acoustic model or language model weights based on detected language

Confidence Scoring

Confidence scores tell downstream consumers how much to trust each word or segment:

  • CTC and RNN-T models produce per-token probabilities naturally; these can be aggregated to word-level confidence
  • Confidence is correlated with audio quality, speaker accent, and vocabulary familiarity
  • Low-confidence words can be flagged for human review (useful in call center analytics or medical transcription)
  • Calibrate confidence scores against actual accuracy on a validation set; raw model probabilities are often overconfident
  • Expose confidence at multiple granularities: per-word, per-segment, and overall transcript

Post-Processing Pipeline

Raw ASR output is often not directly usable. Post-processing improves readability and accuracy:

  • Punctuation and capitalization: ASR models trained on speech do not naturally produce punctuation; use a separate punctuation model or fine-tune on punctuated text
  • Number normalization: convert spoken forms to written forms (e.g., "forty two dollars" -> "42")
  • Domain vocabulary: add custom vocabulary (product names, medical terms, proper nouns) that the base model is unlikely to know; inject via biasing/hotword boosting
  • Profanity filtering: replace or redact profanity based on caller preference
  • Inverse text normalization (ITN): convert spoken-form numbers and abbreviations to their written canonical forms
  • PII redaction: detect and mask personal information (phone numbers, SSNs, credit card numbers) in the transcript

Scaling Streaming Workloads

Streaming ASR is stateful and connection-oriented, which makes scaling harder than stateless batch inference:

  • Each active stream ties up a model server connection and its associated state (audio buffer, model hidden state)
  • Use a connection load balancer (e.g., an L4 load balancer with sticky sessions) to route all chunks of a stream to the same model server instance
  • Scale model server instances based on the number of active connections, not CPU/GPU utilization alone
  • Implement graceful connection migration for rolling deploys: drain existing connections before terminating an instance
  • For very large scale (millions of concurrent streams), shard streams across clusters by region or user cohort

Failure Modes and Mitigations

  • Stream disconnection: buffer the last N seconds on the client; reconnect and resume from last acknowledged timestamp
  • Model server overload: shed load with backpressure; return 429 with retry-after header; degrade gracefully by reducing partial result frequency
  • Poor audio quality: return low confidence scores; surface warnings about background noise or clipping to the caller
  • Unsupported language: detect early and return a clear error rather than producing garbage output
  • Very long silence: detect via VAD and close the stream or send a keep-alive; do not run inference on silent audio

Interview Tips

  • Clarify upfront whether the primary use case is streaming or batch; the architecture differs significantly
  • Explain audio chunking and VAD before jumping into the ASR model; it shows you understand the preprocessing pipeline
  • Mention RNN-T for streaming vs CTC/encoder-decoder for batch; this is a common follow-up question
  • Diarization is often asked as a follow-up; know the basic pipeline (embedding, clustering, alignment)
  • Discuss confidence calibration and how downstream consumers use confidence scores
  • Post-processing is often overlooked; mentioning ITN, punctuation, and custom vocabulary shows production experience

See also: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence

See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering

See also: Meta Interview Guide 2026: Facebook, Instagram, WhatsApp Engineering

Scroll to Top