Question 1

What are the tradeoffs between per-title encoding and a fixed quality ladder?

Accepted Answer

A fixed quality ladder applies the same set of bitrate/resolution rungs to every title regardless of content complexity. Per-title encoding analyzes each title individually and builds a custom ladder, typically saving 20-50% bandwidth for simple content (talking heads, animation) while maintaining the same perceptual quality. The tradeoff is higher up-front compute cost and longer time-to-publish: you must encode a set of probe segments, fit a complexity curve, and then re-encode the final rungs, whereas a fixed ladder can begin encoding immediately with known parameters.

Question 2

How does VMAF differ from PSNR as a perceptual video quality metric?

Accepted Answer

PSNR (Peak Signal-to-Noise Ratio) measures pixel-level mean squared error and correlates poorly with human perception, especially for motion blur, grain, and compression artifacts. VMAF (Video Multi-method Assessment Fusion) is a machine-learning model trained on human opinion scores that combines features such as detail loss, motion compensation, and temporal masking. VMAF scores generally correlate far better with viewer QoE, which is why Netflix uses it as the primary quality gate in per-title and per-scene encoding pipelines. A file can have high PSNR but low VMAF if artifacts appear in regions the eye notices most.

Question 3

When should you use GPU transcoding versus CPU transcoding, and what is the cost-performance difference?

Accepted Answer

GPU transcoding (NVENC, AMD AMF, Intel QSV) offers 5-10x higher throughput for H.264/H.265 at significantly lower per-minute cost once hardware is amortized, making it ideal for live streaming and large-scale VOD pipelines with tight SLAs. CPU transcoding (x264, x265, SVT-AV1) provides superior compression efficiency at equivalent quality—often 10-20% better bitrate savings—because software encoders can perform exhaustive motion search and rate-distortion optimization that fixed-function GPU silicon skips. In practice, large platforms use CPU for final VOD encodes where quality matters and GPU or cloud instances with hardware encoders for live or near-real-time jobs where latency is the constraint.

Question 4

How do you decide between HLS and DASH for adaptive bitrate segment delivery?

Accepted Answer

HLS (HTTP Live Streaming) has native support on all Apple devices and Safari, making it mandatory for iOS and tvOS delivery. DASH (Dynamic Adaptive Streaming over HTTP) is an open standard with better codec flexibility (including AV1, HEVC with any DRM) and is preferred on Android, smart TVs, and web browsers via Media Source Extensions. Most large platforms transcode once and package into both formats using a common fragmented MP4 (fMP4/CMAF) container, which allows the same media segments to be referenced by both HLS and DASH manifests, eliminating duplicate storage. The choice then reduces to manifest format rather than segment duplication.

Question 5

How does ML-based frame selection improve thumbnail generation compared to rule-based approaches?

Accepted Answer

Rule-based thumbnail extraction typically samples frames at fixed intervals (e.g., every 10 seconds) or picks the frame nearest the midpoint, which often captures motion blur, scene transitions, or low-information frames. ML-based approaches train classifiers or ranking models on signals such as face detection confidence, aesthetic quality scores, scene sharpness, brightness distribution, and historically observed click-through rates for similar content. The model scores candidate frames and selects those most likely to communicate content value and drive engagement. Netflix and YouTube have published results showing double-digit CTR improvements from learned thumbnail selection versus heuristic baselines.

Low Level Design: Media Encoding Pipeline

Upload Handling

Job Orchestration

Codec Selection

Quality Ladder

Perceptual Quality Metrics

Thumbnail Generation

Segment Output for Adaptive Streaming

GPU Acceleration