Transcription Service Low-Level Design: Audio Chunking, ASR Integration, and Speaker Diarization

What Is a Transcription Service?

A transcription service converts audio and video recordings into accurate, searchable text with timestamps and speaker labels. It must handle files of varying length, integrate with one or more ASR (automatic speech recognition) engines, merge speaker diarization output with word-level transcripts, score confidence, and make results queryable. Designing it at the component level means deciding how audio is chunked, how ASR calls are parallelized, and how partial results are merged into a final coherent transcript.

Requirements

Functional

Accept audio and video files up to two hours in duration for batch transcription.
Split long files into overlapping chunks to enable parallel ASR processing.
Integrate with configurable ASR engines (Whisper, Google STT, AWS Transcribe) via a pluggable adapter interface.
Merge diarization output (speaker segments) with word-level ASR output to produce speaker-attributed transcripts.
Assign confidence scores at word and segment level and flag low-confidence regions for human review.
Index completed transcripts for full-text search with timestamp anchoring.

Non-Functional

Transcription latency: real-time factor below 0.3x (a 60-minute file completes in under 18 minutes).
Word error rate below 8 percent on clean speech with supported languages.
Horizontal scaling of ASR workers without re-architecting the orchestration layer.

Data Model

TranscriptionJob: job_id, media_asset_id, language, engine, status (PENDING, CHUNKING, TRANSCRIBING, MERGING, INDEXING, DONE, FAILED), created_at, finished_at.
AudioChunk: chunk_id, job_id, chunk_index, start_ms, end_ms, overlap_ms, storage_key, status (PENDING, SENT, DONE, FAILED).
ASRResult: result_id, chunk_id, engine_response_json, words[] (word, start_ms, end_ms, confidence), engine_latency_ms, received_at.
DiarizationResult: diarization_id, job_id, segments[] (speaker_label, start_ms, end_ms).
Transcript: transcript_id, job_id, segments[] (speaker_label, start_ms, end_ms, text, avg_confidence), full_text, word_count, created_at.
ReviewFlag: flag_id, transcript_id, segment_index, reason (LOW_CONFIDENCE, OVERLAP_MISMATCH), resolved, created_at.

Core Algorithms

Audio Chunking with Overlap

The chunker reads the media file header to extract duration_ms, then divides it into chunks of configurable length (default 60 seconds) with a configurable overlap (default 3 seconds on each boundary). Overlap prevents word-boundary clipping at chunk edges. Chunks are extracted using FFmpeg via a subprocess call, written to object storage, and AudioChunk rows are inserted in a single bulk write. The overlap region is later used during merge to disambiguate which chunk owns each boundary word.

ASR Engine Adapter and Parallelism

Each ASR engine is wrapped in an adapter implementing a common interface: transcribe(storage_key, language) -> ASRResult. Workers pull AudioChunk tasks from a Kafka topic, call the adapter, and write ASRResult rows. Concurrency per job is capped by a semaphore stored in Redis (keyed by job_id) to avoid saturating the ASR engine quota. When all chunks for a job reach DONE status, a completion event triggers the merge phase.

Diarization and Transcript Merge

Diarization runs as a separate parallel job on the full audio file (not chunks) using a speaker segmentation model. The merge algorithm iterates over Transcript segments in chronological order. For each word from ASRResult, it finds the overlapping DiarizationResult segment by binary search on start_ms, assigns the speaker label, and groups consecutive words with the same speaker and chunk into a Transcript segment. Words in the overlap region between two chunks are deduplicated by preferring the word with higher confidence. The resulting Transcript is written atomically and the job moves to INDEXING.

Scalability and Reliability

Engine failover: If the primary ASR engine returns an error after two retries, the adapter falls back to a secondary engine. Both results are stored; the primary is preferred for the merge unless its confidence is below threshold.
Chunk-level retry: Failed chunks are retried independently without restarting the entire job. Only the failed chunk re-enters the Kafka topic.
Search indexing: Completed transcripts are indexed into Elasticsearch with one document per segment, storing speaker_label, start_ms, end_ms, and text. Queries return matching segments with timestamp anchors that the client uses to seek the media player.
Cost control: Chunk deduplication by content hash prevents re-transcribing identical audio segments (common in broadcast content with repeated ad breaks).

API Design

POST /transcription-jobs — submit a job with media_asset_id, language, and optional engine preference.
GET /transcription-jobs/{id} — poll job status with chunk progress counts.
GET /transcription-jobs/{id}/transcript — fetch the completed Transcript with speaker-attributed segments.
GET /transcription-jobs/{id}/transcript/vtt — export as WebVTT for caption display.
GET /search?q={query}&job_id={id} — search within a transcript and return segments with timestamps.
POST /transcription-jobs/{id}/flags/{flag_id}/resolve — mark a review flag as resolved after human correction.

Key Design Decisions

Running diarization on the full audio rather than per-chunk avoids speaker label fragmentation across chunk boundaries, which would require a cross-chunk speaker re-identification step. Storing the raw engine_response_json in ASRResult allows reprocessing with an updated merge algorithm without re-calling the ASR engine. Capping concurrency per job with a Redis semaphore rather than a global worker limit allows fair resource sharing between many simultaneous jobs without starving small files behind large ones.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “Why chunk audio with overlap for a transcription service?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “ASR models have a maximum context window. Splitting audio into overlapping chunks (e.g., 30-second chunks with a 5-second overlap) ensures that words straddling a boundary are captured fully in at least one chunk. Duplicate words in the overlap region are removed during a stitching pass that aligns transcripts by timestamp.”
}
},
{
“@type”: “Question”,
“name”: “How do you integrate an ASR engine in a scalable transcription service?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Audio chunks are placed on a work queue. Worker pods pull chunks, call the ASR engine (Whisper, Google Speech-to-Text, AWS Transcribe, etc.) via HTTP or SDK, and store the partial transcript with start/end timestamps. A coordinator reassembles ordered partials once all chunks for an audio file are processed.”
}
},
{
“@type”: “Question”,
“name”: “How is speaker diarization handled in a transcription pipeline?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Speaker diarization runs either as a separate pass on the full audio (using a model like pyannote.audio) or is requested from the ASR vendor as an optional feature. The diarization output is a list of (speaker_id, start_ms, end_ms) segments. These segments are merged with word-level timestamps from the ASR output to annotate each word with a speaker label.”
}
},
{
“@type”: “Question”,
“name”: “What is confidence scoring and how is it used in a transcription service?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Most ASR engines return a per-word or per-segment confidence value between 0 and 1. Low-confidence words are flagged in the transcript (e.g., wrapped in a span or marked with a flag field). Downstream systems can use these flags to route segments to human reviewers, highlight uncertain text in an editor UI, or filter words below a threshold from search indexes.”
}
}
]
}