How are timed text captions generated from a transcript?

Word-level timestamps from the ASR transcript are grouped into caption lines using heuristics: maximum line length (e.g., 42 characters), maximum display duration (e.g., 7 seconds), and natural sentence boundaries. Each group becomes a cue with a start time, end time, and text payload.

What are the differences between WebVTT and SRT output formats?

SRT (SubRip) uses a sequential integer index, HH:MM:SS,mmm timestamp format, and plain text cues. WebVTT adds a 'WEBVTT' header, uses HH:MM:SS.mmm (dot not comma), supports CSS styling via cue settings, and is natively consumable by HTML5 video track elements. Both store cues sequentially separated by blank lines.

How does a caption service support multiple languages?

The caption service stores a language code with each caption track. Source captions (usually in the original audio language) can be passed through a machine translation API (DeepL, Google Translate) to generate additional language tracks. Each language track is stored and served independently; the video player selects a track based on user locale or explicit selection.

How do you detect and correct sync drift in caption tracks?

Sync drift occurs when caption timestamps drift from the audio over time. A drift-correction pass aligns a reference audio fingerprint or scene-cut timestamps against caption cue times and computes a linear or piecewise offset. All cue timestamps in the affected range are shifted by the computed offset. For severe drift, re-transcription of the affected segment is triggered automatically.

Caption Service Low-Level Design: Timed Text Generation, Multi-Language Support, and Sync Correction

⏱ 6 min read

What Is a Caption Service?

A caption service transforms transcripts and media files into timed text tracks that display synchronized subtitles during playback. It must generate accurate timing, support multiple output formats (WebVTT, SRT, TTML), handle multi-language delivery, and provide a workflow for detecting and correcting sync drift. The design must account for format-specific constraints, translation pipelines, and the reality that raw ASR output requires timing alignment before it is usable as captions.

Requirements

Functional

Generate timed caption cues from a transcript with word-level timestamps.
Export captions in WebVTT, SRT, and TTML formats.
Support multiple language tracks per media asset via translation integration.
Detect sync drift between caption cues and audio and surface corrections for review.
Allow editors to apply global time offsets or per-cue adjustments and publish revised tracks.

Non-Functional

Caption generation under 30 seconds for a 60-minute transcript.
Rendered cue files served from CDN with cache headers; no dynamic generation per request.
Support at least 20 language tracks per media asset.

Data Model

CaptionTrack: track_id, media_asset_id, language_code, source (ASR, HUMAN, TRANSLATED), status (DRAFT, REVIEW, PUBLISHED), created_at, published_at.
CaptionCue: cue_id, track_id, cue_index, start_ms, end_ms, text, confidence (nullable), flags[] (LOW_CONFIDENCE, SYNC_DRIFT, MANUAL_EDIT).
CaptionRevision: revision_id, track_id, editor_id, change_type (OFFSET_SHIFT, CUE_EDIT, CUE_DELETE, CUE_INSERT), before_json, after_json, created_at.
CaptionExport: export_id, track_id, format (WEBVTT, SRT, TTML), storage_key, cdn_url, generated_at.
TranslationRequest: request_id, source_track_id, target_language_code, provider, status (PENDING, IN_PROGRESS, DONE, FAILED), created_at, finished_at.

Core Algorithms

Timed Text Generation

The cue generator reads word-level timestamps from the TranscriptionService Transcript and applies a cue-splitting algorithm. Words are accumulated into a cue buffer until one of three conditions is met: the buffer duration exceeds the max cue duration (default seven seconds), the accumulated character count exceeds the max line width (default 42 characters per line, two lines), or a sentence boundary is detected via a punctuation heuristic. Each completed buffer becomes a CaptionCue with start_ms from the first word and end_ms from the last word plus a configurable tail gap (default 200 ms). The generator inserts a minimum inter-cue gap (50 ms) by trimming the previous cue end if necessary.

Format Rendering

Format renderers are stateless functions that accept a list of CaptionCue rows and return a byte string. The WebVTT renderer outputs the WEBVTT header followed by cues in HH:MM:SS.mmm --> HH:MM:SS.mmm format. The SRT renderer uses a 1-based sequential index and HH:MM:SS,mmm separators. The TTML renderer wraps cues in an XML structure with begin and end attributes in SMPTE timecode. Rendered files are written to object storage under a key derived from track_id and format, then a CaptionExport row is inserted and the CDN origin is notified.

Sync Drift Detection and Correction

Sync drift is measured by comparing the audio energy envelope at each cue boundary against the cue start_ms using a cross-correlation window of plus or minus 500 ms. Cues where the correlation peak falls more than 200 ms from the declared start_ms are flagged with SYNC_DRIFT. The drift detector reports a median drift value per track. Editors can apply a global offset (UPDATE all cue start_ms and end_ms by N ms in a single batch) or edit individual cues. Every change writes a CaptionRevision row and invalidates the corresponding CaptionExport CDN URLs via a cache purge call.

Scalability and Reliability

Translation pipeline: Translation requests fan out to a translation provider (DeepL, Google Translate) per target language in parallel. Each translated cue preserves the original start_ms and end_ms; only the text field changes. The translated track is created as a new CaptionTrack with source=TRANSLATED linked to the source track.
CDN caching: Published caption files have long cache TTLs (24 hours). Revised tracks increment a version suffix in the storage key rather than overwriting, so old URLs remain valid during CDN propagation and clients receive new URLs through the media manifest update.
Bulk cue operations: Offset shifts on large tracks use a single UPDATE statement with arithmetic on start_ms and end_ms rather than row-by-row updates, keeping operation time under one second for tracks with up to 10 000 cues.
Export cache invalidation: When a track is re-published, the old CaptionExport rows are marked STALE and new exports are generated asynchronously; the API returns the last published CDN URL until the new export completes.

API Design

POST /caption-tracks — create a track from a transcript_id and language_code; triggers cue generation.
GET /caption-tracks/{id}/cues — paginated list of cues with flags for review.
PATCH /caption-tracks/{id}/cues/{cue_id} — edit cue text or timing; writes revision.
POST /caption-tracks/{id}/offset — apply global time offset in milliseconds to all cues.
POST /caption-tracks/{id}/publish — render all formats, upload to CDN, update media manifest.
POST /caption-tracks/{id}/translate — request translation to one or more target language codes.
GET /caption-tracks/{id}/exports/{format} — return CDN URL for a specific format export.

Key Design Decisions

Storing cues as individual database rows rather than a single serialized file allows per-cue editing, flagging, and partial updates without reloading and re-serializing the entire track. Versioning export storage keys instead of overwriting prevents race conditions where a CDN edge node serves a partially overwritten file. Separating the translation track from the source track means sync corrections applied to the source can be propagated to translations as a batch timing copy without touching translated text.