Designing a media encoding pipeline requires coordinating upload handling, distributed transcoding, codec decisions, and delivery optimization into a coherent system. This deep dive covers each layer with implementation-level detail.
Upload Handling
Large video files require chunked multipart uploads to object storage (S3, GCS). The client initiates an upload session and receives an upload_id that tracks the entire session. The file is split client-side into fixed-size parts (typically 5–100 MB each) and each part is uploaded independently with a part number and MD5 checksum in the request header. The object storage validates the checksum on receipt; a mismatch triggers a retry for that part only. Once all parts are uploaded, the client sends a CompleteMultipartUpload request with the list of part ETags. The server assembles the parts server-side (no data transfer — just metadata stitching) and the final object becomes available. Resumability comes for free: if the upload is interrupted, the client queries which parts have been received and resumes from the first missing part. On completion, the storage service publishes an event (e.g., S3 ObjectCreated notification or a Pub/Sub message) to trigger downstream processing via a message queue.
Job Orchestration
Transcoding is modeled as a DAG (directed acyclic graph) workflow. A typical pipeline looks like: upload → validate → transcode[multiple resolutions in parallel] → thumbnail_generation → quality_check → publish. A job coordinator service (Apache Airflow, AWS Step Functions, or a custom orchestrator) persists the state of each task in a durable store. Per-task states include: PENDING, RUNNING, SUCCEEDED, FAILED. Failed tasks are retried up to a configurable max (e.g., 3 attempts) with exponential backoff. The job is marked complete only when all leaf tasks have reached SUCCEEDED state. Parallel transcoding tasks fan out to a worker fleet via a job queue (SQS, Kafka, RabbitMQ); each worker pulls one task, acquires a lock on the job record, and writes its result back. The coordinator watches task completions to determine which downstream tasks become unblocked. This DAG model cleanly handles branching (parallel resolution encodes) and merging (quality check waits for all encodes).
Codec Selection
Codec choice determines file size, decode compatibility, and licensing cost. H.264 (AVC) remains the default for maximum device and browser compatibility — supported everywhere including older hardware. H.265 (HEVC) delivers roughly 40% smaller files at equivalent perceptual quality, but requires licensing fees and has incomplete browser support (no Firefox). AV1 achieves the best compression ratio of any codec, is royalty-free, and is now supported in Chrome, Firefox, and Edge; it’s the long-term successor to VP9. VP9 is YouTube’s primary codec for 1080p+ content, royalty-free and well-supported in Chrome. In practice, a pipeline generates multiple codec variants per video: H.264 for broad compatibility, AV1 for modern browsers to reduce CDN costs. Codec selection logic keys on content type (animation encodes differently than live action) and target platform capabilities reported via ACCEPT header or client hints.
Quality Ladder
A quality ladder defines the set of resolution/bitrate pairs produced for adaptive bitrate (ABR) streaming. A standard ladder: 240p@300 kbps, 360p@700 kbps, 480p@1200 kbps, 720p@2500 kbps, 1080p@5000 kbps, 4K@15000 kbps. Audio: AAC stereo at 128 kbps for standard content, 5.1 surround encoded at 384 kbps for supported content. A fixed ladder wastes bits on simple content (static talking-head video needs far fewer bits than fast-action sports). Per-title encoding solves this: analyze each video’s complexity (spatial detail, motion, scene changes) before encoding and generate an optimal ladder for that specific title. Netflix’s per-shot encoding takes this further — the ladder changes shot-by-shot within the same video. The complexity analysis pass runs a quick low-quality preview encode and measures metrics like DCT coefficient variance and motion vector magnitude to determine the optimal bitrate allocations.
Perceptual Quality Metrics
PSNR (peak signal-to-noise ratio) is easy to compute but correlates poorly with human perception. SSIM (structural similarity index) is better — it measures luminance, contrast, and structural similarity. VMAF (Video Multi-method Assessment Fusion), developed by Netflix, uses a machine learning model trained on human opinion scores and is the industry standard for perceptual quality. Target VMAF >= 93 for high-quality delivery; scores below 85 are visibly degraded. The automated quality check step in the DAG runs VMAF scoring on each encode. Encodes below threshold are rejected and the task is retried with higher bitrate or different encoder settings. SSIM and PSNR are recorded as secondary metrics for trend analysis. VMAF computation is CPU-intensive (roughly 1x realtime on a single core); it runs on a subset of frames (every 5th frame) to reduce cost while maintaining accuracy.
Thumbnail Generation
Thumbnail quality significantly impacts click-through rate. The naive approach — extract a frame at a fixed time offset — often captures a motion-blurred or unrepresentative frame. Better approach: extract candidate keyframes at regular intervals (e.g., every 10 seconds) plus at detected scene boundaries (scene boundary detection via histogram difference between consecutive frames). A trained ML model scores each candidate frame on visual appeal signals: sharpness (Laplacian variance), face presence and size (face detection model), text overlay (OCR model detects burned-in captions), motion blur (frequency domain analysis), and aesthetic composition (pretrained aesthetic scoring model). Top-N candidates (typically 3–5) are stored in object storage. A/B testing infrastructure presents different thumbnails to user cohorts and measures click-through rate to select the winner. The winner thumbnail is promoted as the default; the process can run continuously as new viewing data arrives.
Segment Output for Adaptive Streaming
HLS (HTTP Live Streaming) and DASH (Dynamic Adaptive Streaming over HTTP) both require segmenting the encoded video into short chunks. HLS uses fixed-duration segments, typically 2–10 seconds, with a .m3u8 manifest listing segment URLs. DASH uses an MPD (media presentation description) manifest and supports both fixed and variable segment durations. Segments are stored in object storage with cache-friendly URL design: include the content hash or encode fingerprint in the path to produce immutable URLs (e.g., /encoded/{content_hash}/{resolution}/{segment_number}.ts). Immutable URLs allow infinite CDN cache TTLs — the file never changes, so it never needs invalidation. Only the manifest file has a short TTL since it references the latest segments. This design minimizes origin load: the CDN serves 99%+ of segment requests from cache with no origin hit.
GPU Acceleration
Software encoding with FFmpeg on CPU is the baseline but is slow and expensive at scale. NVIDIA’s NVENC hardware encoder (integrated into the GPU) achieves 10–20x faster encoding than CPU at equivalent quality settings. A single GPU can replace 10–20 CPU cores for transcoding workloads. The transcoding worker fleet is GPU-accelerated (g4dn or p3 instances on AWS) and autoscales based on job queue depth: CloudWatch alarm triggers an ASG scale-out when queue depth exceeds threshold, scale-in when queue drains. Spot/preemptible instances reduce cost by 60–70% for transcoding — jobs are resumable (segment-level checkpointing), so instance interruption only loses the current segment. A small on-demand baseline fleet handles urgent jobs that cannot tolerate interruption delays.
{ “@context”: “https://schema.org”, “@type”: “FAQPage”, “mainEntity”: [ { “@type”: “Question”, “name”: “What are the tradeoffs between per-title encoding and a fixed quality ladder?”, “acceptedAnswer”: { “@type”: “Answer”, “text”: “A fixed quality ladder applies the same set of bitrate/resolution rungs to every title regardless of content complexity. Per-title encoding analyzes each title individually and builds a custom ladder, typically saving 20-50% bandwidth for simple content (talking heads, animation) while maintaining the same perceptual quality. The tradeoff is higher up-front compute cost and longer time-to-publish: you must encode a set of probe segments, fit a complexity curve, and then re-encode the final rungs, whereas a fixed ladder can begin encoding immediately with known parameters.” } }, { “@type”: “Question”, “name”: “How does VMAF differ from PSNR as a perceptual video quality metric?”, “acceptedAnswer”: { “@type”: “Answer”, “text”: “PSNR (Peak Signal-to-Noise Ratio) measures pixel-level mean squared error and correlates poorly with human perception, especially for motion blur, grain, and compression artifacts. VMAF (Video Multi-method Assessment Fusion) is a machine-learning model trained on human opinion scores that combines features such as detail loss, motion compensation, and temporal masking. VMAF scores generally correlate far better with viewer QoE, which is why Netflix uses it as the primary quality gate in per-title and per-scene encoding pipelines. A file can have high PSNR but low VMAF if artifacts appear in regions the eye notices most.” } }, { “@type”: “Question”, “name”: “When should you use GPU transcoding versus CPU transcoding, and what is the cost-performance difference?”, “acceptedAnswer”: { “@type”: “Answer”, “text”: “GPU transcoding (NVENC, AMD AMF, Intel QSV) offers 5-10x higher throughput for H.264/H.265 at significantly lower per-minute cost once hardware is amortized, making it ideal for live streaming and large-scale VOD pipelines with tight SLAs. CPU transcoding (x264, x265, SVT-AV1) provides superior compression efficiency at equivalent quality—often 10-20% better bitrate savings—because software encoders can perform exhaustive motion search and rate-distortion optimization that fixed-function GPU silicon skips. In practice, large platforms use CPU for final VOD encodes where quality matters and GPU or cloud instances with hardware encoders for live or near-real-time jobs where latency is the constraint.” } }, { “@type”: “Question”, “name”: “How do you decide between HLS and DASH for adaptive bitrate segment delivery?”, “acceptedAnswer”: { “@type”: “Answer”, “text”: “HLS (HTTP Live Streaming) has native support on all Apple devices and Safari, making it mandatory for iOS and tvOS delivery. DASH (Dynamic Adaptive Streaming over HTTP) is an open standard with better codec flexibility (including AV1, HEVC with any DRM) and is preferred on Android, smart TVs, and web browsers via Media Source Extensions. Most large platforms transcode once and package into both formats using a common fragmented MP4 (fMP4/CMAF) container, which allows the same media segments to be referenced by both HLS and DASH manifests, eliminating duplicate storage. The choice then reduces to manifest format rather than segment duplication.” } }, { “@type”: “Question”, “name”: “How does ML-based frame selection improve thumbnail generation compared to rule-based approaches?”, “acceptedAnswer”: { “@type”: “Answer”, “text”: “Rule-based thumbnail extraction typically samples frames at fixed intervals (e.g., every 10 seconds) or picks the frame nearest the midpoint, which often captures motion blur, scene transitions, or low-information frames. ML-based approaches train classifiers or ranking models on signals such as face detection confidence, aesthetic quality scores, scene sharpness, brightness distribution, and historically observed click-through rates for similar content. The model scores candidate frames and selects those most likely to communicate content value and drive engagement. Netflix and YouTube have published results showing double-digit CTR improvements from learned thumbnail selection versus heuristic baselines.” } } ] }See also: Anthropic Interview Guide 2026: Process, Questions, and AI Safety
See also: Airbnb Interview Guide 2026: Search Systems, Trust and Safety, and Full-Stack Engineering