Low Level Design: Video Processing Pipeline

Ingestion: Chunked Upload and Raw Storage

Large video files cannot be uploaded in a single HTTP request reliably. The ingestion layer accepts chunked uploads: the client splits the file into fixed-size chunks (e.g., 5 MB), uploads each with a chunk index and upload session ID, and signals completion when all chunks arrive. The server reassembles chunks into the original file and writes it to a raw storage bucket — an S3 prefix or equivalent that holds unprocessed source files.

A successful reassembly triggers a processing job. The job is written to a durable queue with the raw file path, upload metadata, and priority level. Live content (streams, time-sensitive uploads) gets elevated priority over background batch uploads.

Job Queue and Worker Pool

The job queue is a priority queue backed by a persistent broker (SQS, Kafka, or Redis Sorted Set). Workers pull jobs by priority. Each worker is a transcoding process that wraps FFmpeg. The worker pool scales horizontally: more workers handle peak load, idle workers drain down. Workers are stateless — all state lives in the job record and output storage.

A job is marked in-progress when claimed and has a heartbeat timeout. If the worker crashes without completing, the job becomes visible again after the timeout and is retried. Max retries (e.g., 3) before a job moves to a dead-letter queue for manual inspection.

Transcoding: Codecs and Adaptive Bitrate

Each job produces multiple output renditions for adaptive bitrate streaming (HLS or DASH). Target resolutions: 360p, 720p, 1080p, 4K (if source supports it). Each resolution is encoded in one or more codecs:

H.264 (AVC) — universal support, highest compatibility
H.265 (HEVC) — ~40% smaller than H.264 at equivalent quality, requires hardware decode support
VP9 — royalty-free, good browser support
AV1 — best compression ratio, slow to encode, growing hardware support

HLS output consists of a master playlist (.m3u8), per-rendition playlists, and segment files (.ts or .fmp4). Segments are typically 6 seconds. The master playlist lists all renditions with bandwidth hints so the player selects appropriate quality based on network conditions.

Thumbnail Extraction

Thumbnails are extracted from key timestamps: 0% (first non-black frame), 25%, 50%, 75%, and a heuristic best-frame selector that avoids scene transitions and low-contrast frames. FFmpeg handles extraction. Thumbnails are stored as JPEG and WebP, sized to standard dimensions (1280×720, 640×360, 320×180).

Metadata Extraction

After transcoding, the worker extracts and stores technical metadata from the source file using ffprobe:

duration — total length in seconds
resolution — width and height of source
codec — source video and audio codec identifiers
fps — frames per second
audio_channels — mono, stereo, or surround
bitrate — overall and per-stream

Quality Validation

Before marking a job complete, the worker runs quality checks on transcoded output. VMAF (Video Multimethod Assessment Fusion) scores each rendition against the source — a score below threshold (e.g., 85 for 1080p) triggers a re-encode with adjusted quality settings. Black frame detection scans the first and last 10 seconds to catch encoding failures that produce silent black output. Audio loudness normalization is checked against EBU R128 targets.

Progress Events and Job State Machine

Clients subscribe to job progress via Server-Sent Events (SSE) or webhook callbacks. The job state machine has the following states: queued → ingesting → transcoding → validating → completed (or failed). Each state transition emits an event with a progress percentage and estimated time remaining.

On completion, the worker writes the HLS master manifest path, all segment paths, thumbnail paths, and extracted metadata to the video record. The raw source file is moved to cold storage or deleted per retention policy.

Frequently Asked Questions

What is a video processing pipeline in system design?

A video processing pipeline ingests raw uploaded video, validates and demuxes it, runs transcoding workers to produce multiple quality renditions, performs quality validation, packages the output into adaptive streaming formats (HLS, DASH), and publishes the result to an origin store fronted by a CDN. Each stage is decoupled via a job queue so workers can scale independently, retry on failure, and run expensive transcoding in parallel across many machines.

How does adaptive bitrate transcoding work?

Adaptive bitrate (ABR) transcoding produces multiple renditions of the same video at different resolutions and bitrates (e.g., 240p/400kbps, 720p/2.5Mbps, 1080p/8Mbps). The packager writes a manifest file (m3u8 for HLS, mpd for DASH) listing all renditions and their segment URLs. The client player begins playback at a conservative rendition, measures actual download throughput and buffer health every few seconds, and switches to a higher or lower rendition at segment boundaries — keeping the buffer full while maximizing quality for the available bandwidth.

How do you validate video quality after transcoding?

Quality validation runs automated metrics over the transcoded output before it is published. Structural checks verify container integrity, codec parameters, frame rate, and audio sync. Perceptual quality metrics such as VMAF (Video Multi-Method Assessment Fusion) compare the transcoded rendition against the source and produce a score correlated with human perception; a score below a configurable threshold triggers a re-transcode or human review. Thumbnail spot-checks and black-frame detection catch encoder bugs. All scores are logged for trending and regression alerting.

How do you prioritize video processing jobs for live vs on-demand content?

Live content requires near-real-time segment transcoding with strict latency SLAs (seconds), while on-demand uploads tolerate minutes to hours of processing time. Priority queues assign live jobs to a dedicated high-priority worker pool with reserved capacity, bypassing any shared queue depth. On-demand jobs enter a normal or low-priority queue and are subject to worker autoscaling based on queue depth. Within on-demand, additional factors — creator tier, expected audience size, content age — can further rank jobs so recently uploaded popular content finishes before archival re-encodes.