Live Video Streaming System Low-Level Design

Requirements

Design a live video streaming platform comparable to Twitch or YouTube Live. Key requirements: support millions of concurrent viewers per stream, end-to-end latency under 10 seconds for live content, adaptive quality from 360p up to 1080p and 4K, graceful degradation for viewers on slow connections, and live chat co-located with the stream.

Scale targets: Twitch peaks at ~8 million concurrent viewers, ~30,000 concurrent streams. Each stream generates 6-8 transcoded variants. The CDN must absorb the vast majority of delivery bandwidth – the origin cannot serve viewers directly at this scale.

Video Ingest

The streamer runs broadcasting software (OBS, Streamlabs, or a native app) that encodes video locally and sends a high-bitrate RTMP stream to an ingest server. RTMP (Real-Time Messaging Protocol) is the standard for live ingest because it is low latency and widely supported by encoder software.

Ingest servers are geographically distributed (closest ingest point minimizes the streamer’s upload latency and packet loss). A stream key authenticates the streamer. The ingest server accepts the incoming RTMP connection, validates the stream key, and receives the raw high-bitrate stream (typically 6-15 Mbps for 1080p60). It buffers and segments the incoming stream into 2-6 second chunks and passes those chunks to the transcoding pipeline.

The ingest path is latency-critical for the streamer but not for viewers – the viewer always lags behind live by at least a few segment durations. Keep ingest servers close to streamers, not close to viewers.

Transcoding Pipeline

Each 2-6 second chunk from the ingest server is transcoded into multiple quality tiers in parallel: 360p (400 Kbps), 480p (800 Kbps), 720p (2.5 Mbps), 1080p (5 Mbps), and optionally 4K (15+ Mbps). This is the most compute-intensive part of the system.

Transcoding is GPU-accelerated (NVENC on NVIDIA GPUs, or cloud-based GPU instances). The pipeline scales horizontally – each chunk can be dispatched to a separate transcoding worker. A job queue (SQS, Kafka, or internal) distributes chunks to available workers. Workers output HLS or DASH segments for each quality tier and upload them to object storage or push them to CDN edge caches.

At 30,000 concurrent streams with 6 quality tiers each, the transcoding fleet is substantial. Cloud autoscaling or a dedicated fleet with capacity headroom handles peak load. Cost is significant – transcoding is typically the largest infrastructure cost for a live streaming platform.

Adaptive Bitrate Streaming (HLS / DASH)

HLS (HTTP Live Streaming) is the dominant delivery format for live video. The transcoding pipeline produces:

Segment files (.ts or .mp4 fragments), 2-6 seconds each, one set per quality tier
A variant playlist (.m3u8) listing available quality tiers with their bitrates and resolutions
A per-quality media playlist (.m3u8) listing the URLs of the most recent segments

The video player fetches the variant playlist, chooses an initial quality tier, then repeatedly fetches the media playlist for that tier to discover new segments and download them before the playback buffer runs out.

ABR logic in the player: monitor download throughput and buffer occupancy. If segments download faster than playback, buffer grows – player can switch up to higher quality. If download falls behind, buffer shrinks – player switches down to avoid stall. The playlist is updated every segment duration (2-6 seconds) with the latest live segments appended and old segments removed.

CDN for Delivery

HLS segments are small static files (a few hundred KB each). They are perfect for CDN delivery – cacheable, immutable once created, served over standard HTTP. Each viewer fetches segments from the nearest CDN edge node rather than the origin.

CDN configuration for live HLS: segment TTL equals the segment duration (2-6 seconds) so old segments expire automatically. The media playlist TTL is short (1-3 seconds) to ensure players see new segments quickly. An origin shield (a mid-tier CDN layer) aggregates cache misses from edge nodes so the origin sees at most one request per segment per PoP rather than one per viewer.

For a stream with 100,000 concurrent viewers downloading 2 MB/s each, total bandwidth is 200 Gbps. The CDN handles this. The origin handles only the initial cache miss for each segment (once per PoP per 2-6 seconds), a tiny fraction of total requests.

Low Latency Techniques

Standard HLS with 6-second segments produces 20-30 seconds of end-to-end latency (ingest + transcoding + 3 segments in buffer). For interactive live streaming, this is too high.

Low-Latency HLS (LL-HLS): Apple’s extension to HLS. Segments are divided into “parts” of 200ms-1s. Players can download parts before the full segment is complete. The playlist includes hints for the next part so players can pre-fetch. Achieves 2-5 second latency over HTTP.

CMAF (Common Media Application Format): a container format that enables “chunked transfer” delivery – the server starts sending the segment over an open HTTP connection as it is being written, rather than waiting for the full segment. Works with both HLS and DASH. Reduces latency by the segment duration.

WebRTC: achieves sub-500ms latency but uses peer-to-peer or SFU (selective forwarding unit) architecture that does not scale to millions of viewers per stream. Used for interactive scenarios (video calls, small watch parties) but not for broadcast-scale live streaming.

Chat System Design

Live chat is a fan-out problem: one viewer sends a message that must be delivered to all other viewers of the same stream within a second or two. At 100,000 concurrent viewers, this is 100,000 WebSocket connections that need to receive each message.

Architecture: viewers hold persistent WebSocket connections to chat servers. Chat servers are stateless – any server can accept incoming messages. When a viewer sends a message, the receiving chat server publishes it to a Redis pub/sub channel keyed by stream ID. All chat servers subscribed to that channel (those with viewers watching that stream) receive the message and push it to their connected WebSocket clients.

For very large streams (100K+ viewers), Redis pub/sub fan-out to chat servers can bottleneck. Partition chat servers by stream and use consistent routing so all viewers of a stream connect to the same subset of servers, reducing pub/sub load.

Moderation: rule-based filters (banned words, regex patterns) applied synchronously before message delivery. ML-based classifiers for hate speech and harassment run asynchronously – flag messages for human review rather than blocking in the hot path. Slow mode rate-limits each user to one message per N seconds, enforced with a Redis key per (user, stream) with a TTL.

Storage and VOD

During a live stream, transcoded segments are stored in object storage (S3) as they are produced. When the stream ends, the platform stitches the segments into full video files and creates a VOD (video on demand) asset. Additional transcoding runs may produce formats not generated during the live stream (for example, higher quality encodes that take longer than real-time).

Thumbnail generation runs as an async job: sample frames from the segment files, run through a thumbnail selection model (sharpness, face detection, text avoidance), store thumbnails in object storage. VOD files are served through the same CDN as live segments.

Storage costs are significant: 1 hour of 1080p live stream at 5 Mbps = 2.25 GB per quality tier. With 6 tiers, ~13 GB per stream-hour. At 30,000 concurrent streams averaging 2 hours, daily storage growth is ~780 TB. A lifecycle policy moves older VODs to cheaper storage tiers (S3 Glacier) and deletes after a retention period unless the creator opts into permanent storage.

Scale Numbers

Reference numbers for interview context:

Twitch peak: ~8 million concurrent viewers, ~30,000 concurrent streams
Each stream: 6-8 transcoded quality variants produced in real time
Segment cadence: one new HLS segment every 2-6 seconds per stream per quality = ~1.5M segment files created per minute across all streams
CDN handles 99%+ of viewer bandwidth; origin sees only cache-miss traffic
Ingest bandwidth per stream: 6-15 Mbps (streamer upload)
Aggregate viewer bandwidth: tens of Tbps across the CDN at peak
Chat at scale: a single popular stream can generate 50,000+ messages per minute

{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”How does adaptive bitrate streaming (HLS) work?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”HLS (HTTP Live Streaming) breaks a video stream into small segments (2-10 seconds each). Each segment is encoded at multiple quality levels (bitrates): 360p at 500kbps, 720p at 2Mbps, 1080p at 5Mbps. A master playlist file (.m3u8) lists the available quality levels and their segment playlists. Each quality-specific playlist lists the segment URLs. The player downloads the playlist, measures download speed and buffer fill level, then selects the appropriate quality. If download slows, the player switches to a lower quality for the next segment. Segments are served over HTTP – standard CDN caching applies. The player maintains a buffer (typically 10-30 seconds ahead). ABR eliminates buffering at the cost of occasional quality switches.”}},{“@type”:”Question”,”name”:”How do you design a live streaming ingest pipeline?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Streamers push video via RTMP to an ingest server. The ingest server is the first-mile receiver: it accepts the live RTMP stream, splits it into small chunks (2-6 second segments), and writes them to an object store (S3 or equivalent) as the source of record. A transcoding cluster (GPU-based) watches for new chunks, processes each in parallel at multiple bitrates (360p through 1080p), and writes the transcoded segments to the CDN origin. A playlist updater updates the HLS m3u8 playlist file with the new segment URLs every 2-6 seconds. The CDN distributes the updated playlist and new segments to viewers globally. Total pipeline latency from streamer to viewer: 6-30 seconds for standard HLS, 2-4 seconds for LL-HLS.”}},{“@type”:”Question”,”name”:”How does a live stream chat system scale to millions of concurrent viewers?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”WebSocket connections from viewers to chat servers. Each chat server handles ~50K-100K concurrent WebSocket connections. Chat message flow: viewer sends message to chat server via WebSocket, chat server publishes to Redis pub/sub channel (one channel per stream), all other chat servers subscribed to that channel receive the message and push to their connected viewers. Redis pub/sub fan-out handles the within-datacenter distribution; geographic distribution requires either cross-region message relay or regional Redis instances. Rate limiting: each user can send at most 1 message per N seconds (slow mode), enforced at the chat server using Redis counters. Moderation: keyword filter + ML classifier for spam/abuse, asynchronous to avoid adding latency to the message delivery path.”}},{“@type”:”Question”,”name”:”How do you minimize latency in live streaming?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Standard HLS latency is 10-30 seconds (segment duration + player buffer + CDN propagation). Techniques to reduce: (1) LL-HLS (Low Latency HLS): sub-segments of 200-500ms, players can start playing before the full segment is ready. Reduces to 2-4 seconds. (2) CMAF (Common Media Application Format): standard chunked encoding that both HLS and DASH understand, eliminates conversion overhead. (3) Reduce segment duration: 2-second segments instead of 6-second reduces buffer but increases server and CDN request rate. (4) WebRTC: sub-500ms latency using peer-to-peer or SFU (Selective Forwarding Unit), but scales to hundreds not millions of viewers. (5) Edge processing: transcode at CDN edge nodes closest to the viewer to reduce propagation delay. For most platforms, 3-8 second latency with LL-HLS is the sweet spot between scale and freshness.”}},{“@type”:”Question”,”name”:”How do you store and serve VOD (Video on Demand) after a live stream ends?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”When the live stream ends: (1) Stitch all HLS segments into a single video file (MP4 or MKV). (2) Store the source file in object storage (S3) with versioned copies. (3) Transcode offline to additional formats and resolutions not available during live (including ultra-HD). (4) Generate thumbnails at regular intervals (every 10 seconds) for the video scrubber preview. (5) Update the stream URL from the live HLS playlist to a VOD playlist pointing to the stitched segments. (6) CDN TTL changes from short (live: 6 seconds) to long (VOD: 1 year) – content is now immutable. (7) Index the video content for search (title, description, transcript if auto-generated). VOD serving is simpler than live since content is fixed – standard CDN + object storage with no ingest pipeline.”}}]}

Twitch is the canonical live streaming platform. See system design questions for Twitch interview: live streaming and video delivery design.

Netflix uses streaming architecture at massive scale. See system design patterns for Netflix interview: video streaming and CDN delivery design.

Snap uses live video and streaming for Stories. See system design patterns for Snap interview: live stories and real-time video delivery.