Low Level Design: Video Streaming Service

Video streaming at scale involves multiple specialized subsystems working in tight coordination: upload pipelines, real-time transcoding, adaptive bitrate delivery, content protection, and client-side playback state. This guide covers the low-level design of each layer.

Video Upload Pipeline

Large video files cannot be uploaded as a single HTTP request — network interruptions would require starting over. Instead, clients use chunked (resumable) upload:

  1. Client requests an upload session: POST /uploads returns an upload_id and a presigned URL pattern.
  2. Client splits the file into 5–16 MB chunks and uploads each with a Content-Range header to the object storage presigned URL (S3 multipart upload or equivalent).
  3. On network failure, the client queries the upload session for the last received byte offset and resumes from there.
  4. On final chunk receipt, object storage emits a completion event. The upload service marks the upload record as processing and publishes a transcoding_job message to Kafka: {video_id, source_path, upload_id, requested_quality_ladder}.

Source files are stored in a raw bucket separate from the transcoded output bucket. Lifecycle policies delete source files after transcoding completes successfully (or after 30 days if transcoding fails repeatedly).

Transcoding Pipeline

Transcoding workers consume jobs from the Kafka transcoding_jobs topic (partitioned by video_id to avoid parallel transcoding of the same video). Each worker runs FFmpeg to produce the quality ladder:

Quality ladder (H.264 + AAC):
  240p  — 400 kbps video, 64 kbps audio
  480p  — 1000 kbps video, 128 kbps audio
  720p  — 2500 kbps video, 128 kbps audio
  1080p — 5000 kbps video, 192 kbps audio
  4K    — 15000 kbps video, 192 kbps audio

Each quality level is segmented into 6-second chunks. Workers also generate:

  • Multiple audio tracks (original + dubbed languages) as separate streams.
  • Subtitle/caption tracks (WebVTT format) from uploaded SRT files or automated speech recognition output.
  • Thumbnail sprites: a single JPEG mosaic of keyframes used by the player scrubber.

Transcoding is CPU-bound. Workers run on spot/preemptible instances in an auto-scaling group. Job failures are retried up to 3 times with exponential backoff; persistent failures move to a dead-letter queue for manual inspection. Output segments are written to the CDN origin bucket with a path structure of /videos/{video_id}/{quality}/seg{N}.ts.

HLS and DASH Adaptive Bitrate

HLS (HTTP Live Streaming): The transcoder generates a master playlist (master.m3u8) that references per-bitrate playlists. Each per-bitrate playlist lists the segment URLs with their durations. The client player downloads the master playlist, selects an initial bitrate based on current bandwidth estimate, and begins fetching segments sequentially.

#EXTM3U
#EXT-X-STREAM-INF:BANDWIDTH=2500000,RESOLUTION=1280x720
720p/index.m3u8
#EXT-X-STREAM-INF:BANDWIDTH=5000000,RESOLUTION=1920x1080
1080p/index.m3u8

DASH (Dynamic Adaptive Streaming over HTTP): Uses an XML manifest (.mpd) with AdaptationSet elements for video, audio, and subtitles. The structure is functionally equivalent to HLS but with a more flexible schema that handles multi-period content (e.g., mid-roll ad insertion) better.

ABR algorithm: The client player measures the download speed of each segment. If measured bandwidth drops below the current bitrate’s requirement (with a safety margin), the player switches down one quality level on the next segment boundary. Switches up require sustained higher bandwidth for several segments to avoid oscillation (buffer-based rate adaptation). Segment boundaries are the only switch points — there are no mid-segment quality changes.

CDN Delivery

Video segments are immutable once written: a given segment URL always returns the same bytes. This makes them ideal CDN cache objects. Cache-control headers are set to max-age=31536000, immutable for segments. Manifest files (.m3u8, .mpd) are mutable during live transcoding but immutable for VOD — VOD manifests also get long TTLs after the transcoding job completes.

CDN architecture: a multi-tier setup with a small number of origin shield nodes sitting between the CDN edge POPs and the object storage origin. The shield absorbs cache misses from many edge nodes, reducing origin load. Edge POP selection is done via Anycast DNS — the viewer’s DNS resolver returns the IP of the nearest POP based on BGP topology.

For popular videos, segments are proactively pushed to edge caches at publish time (cache warming) to avoid a cold-start miss storm when a video goes viral. The warming job reads the master playlist, enumerates all segment URLs, and issues HEAD requests to each edge POP, forcing population.

DRM Content Protection

Premium content requires DRM. The industry standard approach is Common Encryption (CENC/CBCS): the content is encrypted once and can be decrypted by any supported DRM system using a common key ID.

  • Widevine (Google): used by Chrome, Android, Chromecast. License server issues a Widevine license containing the content key to authenticated players.
  • FairPlay (Apple): used by Safari, iOS, tvOS. Requires a separate FairPlay license server; Apple mandates the FPS (FairPlay Streaming) protocol.
  • PlayReady (Microsoft): used by Edge, Xbox, Windows. License server issues PlayReady licenses.

License server flow: the player requests a license, including a device certificate and a license challenge generated by the DRM client. The license server verifies the user’s entitlement (authenticated session, active subscription), then issues the license containing the content decryption key wrapped for the specific device. Keys are never transmitted in the clear. The license server logs all key issuances for audit purposes. License TTL is short (e.g., 24 hours) to limit exposure if a device is compromised.

Resume Playback

Playback position is stored server-side so users can resume on any device:

playback_positions (
  user_id     UUID,
  video_id    UUID,
  position_ms BIGINT,   -- milliseconds from start
  updated_at  TIMESTAMP,
  PRIMARY KEY (user_id, video_id)
)

The client writes position updates at two points: on pause events and on a periodic 10-second timer while playing. Writing every second would generate excessive traffic; 10 seconds means at most 10 seconds of progress is lost on a crash. On starting playback, the player fetches the stored position and seeks to it before beginning segment download. If position is within 5 seconds of the end, playback starts from the beginning (treating the video as "rewatched").

The position table uses a simple upsert (INSERT … ON CONFLICT DO UPDATE). The last-writer-wins semantics are acceptable here: if the same user has two devices playing simultaneously (unusual), whichever writes last wins — position conflicts are not worth a distributed coordination protocol.

Video Metadata and Thumbnails

Video metadata is stored in a relational database (PostgreSQL) and cached in Redis:

videos (
  video_id        UUID PRIMARY KEY,
  title           TEXT,
  description     TEXT,
  duration_ms     BIGINT,
  tags            TEXT[],
  category_id     INT,
  content_rating  TEXT,   -- G | PG | PG-13 | R
  thumbnail_url   TEXT,
  upload_user_id  UUID,
  published_at    TIMESTAMP,
  status          TEXT    -- processing | published | unlisted | deleted
)

Thumbnail generation runs as part of the transcoding pipeline. FFmpeg extracts keyframes at regular intervals (every 10 seconds). A lightweight ML model (MobileNet-based classifier trained on click-through rate data) scores each candidate frame for visual quality, face presence, motion blur, and brightness. The highest-scoring frame is selected as the default thumbnail. Creators can override with a custom upload. Thumbnails are stored in object storage and served via CDN.

Player Metrics and Quality of Experience

The player client reports telemetry events to a metrics ingestion endpoint:

  • startup_time_ms: time from play() call to first frame rendered.
  • bitrate_switch: {from_quality, to_quality, reason, position_ms}.
  • buffer_empty: {duration_ms, position_ms} — rebuffering event.
  • error: {error_code, position_ms, cdn_pop, segment_url}.
  • heartbeat: every 30 seconds while playing — {current_quality, buffer_length_ms, position_ms}.

Events are batched by the client and sent as JSON arrays every 30 seconds to reduce request overhead. The ingestion service writes to Kafka, which feeds a real-time aggregation pipeline (Flink or Spark Streaming) computing per-CDN-POP quality metrics, per-ISP rebuffering rates, and per-device-type error rates. Dashboards on these aggregates allow the infrastructure team to detect CDN issues, bad segment encodes, or DRM license server outages within minutes of onset.

See also: Anthropic Interview Guide 2026: Process, Questions, and AI Safety

See also: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence

See also: Airbnb Interview Guide 2026: Search Systems, Trust and Safety, and Full-Stack Engineering

Scroll to Top