Requirements
Design a live video streaming platform comparable to Twitch or YouTube Live. Key requirements: support millions of concurrent viewers per stream, end-to-end latency under 10 seconds for live content, adaptive quality from 360p up to 1080p and 4K, graceful degradation for viewers on slow connections, and live chat co-located with the stream.
Scale targets: Twitch peaks at ~8 million concurrent viewers, ~30,000 concurrent streams. Each stream generates 6-8 transcoded variants. The CDN must absorb the vast majority of delivery bandwidth – the origin cannot serve viewers directly at this scale.
Video Ingest
The streamer runs broadcasting software (OBS, Streamlabs, or a native app) that encodes video locally and sends a high-bitrate RTMP stream to an ingest server. RTMP (Real-Time Messaging Protocol) is the standard for live ingest because it is low latency and widely supported by encoder software.
Ingest servers are geographically distributed (closest ingest point minimizes the streamer’s upload latency and packet loss). A stream key authenticates the streamer. The ingest server accepts the incoming RTMP connection, validates the stream key, and receives the raw high-bitrate stream (typically 6-15 Mbps for 1080p60). It buffers and segments the incoming stream into 2-6 second chunks and passes those chunks to the transcoding pipeline.
The ingest path is latency-critical for the streamer but not for viewers – the viewer always lags behind live by at least a few segment durations. Keep ingest servers close to streamers, not close to viewers.
Transcoding Pipeline
Each 2-6 second chunk from the ingest server is transcoded into multiple quality tiers in parallel: 360p (400 Kbps), 480p (800 Kbps), 720p (2.5 Mbps), 1080p (5 Mbps), and optionally 4K (15+ Mbps). This is the most compute-intensive part of the system.
Transcoding is GPU-accelerated (NVENC on NVIDIA GPUs, or cloud-based GPU instances). The pipeline scales horizontally – each chunk can be dispatched to a separate transcoding worker. A job queue (SQS, Kafka, or internal) distributes chunks to available workers. Workers output HLS or DASH segments for each quality tier and upload them to object storage or push them to CDN edge caches.
At 30,000 concurrent streams with 6 quality tiers each, the transcoding fleet is substantial. Cloud autoscaling or a dedicated fleet with capacity headroom handles peak load. Cost is significant – transcoding is typically the largest infrastructure cost for a live streaming platform.
Adaptive Bitrate Streaming (HLS / DASH)
HLS (HTTP Live Streaming) is the dominant delivery format for live video. The transcoding pipeline produces:
- Segment files (.ts or .mp4 fragments), 2-6 seconds each, one set per quality tier
- A variant playlist (.m3u8) listing available quality tiers with their bitrates and resolutions
- A per-quality media playlist (.m3u8) listing the URLs of the most recent segments
The video player fetches the variant playlist, chooses an initial quality tier, then repeatedly fetches the media playlist for that tier to discover new segments and download them before the playback buffer runs out.
ABR logic in the player: monitor download throughput and buffer occupancy. If segments download faster than playback, buffer grows – player can switch up to higher quality. If download falls behind, buffer shrinks – player switches down to avoid stall. The playlist is updated every segment duration (2-6 seconds) with the latest live segments appended and old segments removed.
CDN for Delivery
HLS segments are small static files (a few hundred KB each). They are perfect for CDN delivery – cacheable, immutable once created, served over standard HTTP. Each viewer fetches segments from the nearest CDN edge node rather than the origin.
CDN configuration for live HLS: segment TTL equals the segment duration (2-6 seconds) so old segments expire automatically. The media playlist TTL is short (1-3 seconds) to ensure players see new segments quickly. An origin shield (a mid-tier CDN layer) aggregates cache misses from edge nodes so the origin sees at most one request per segment per PoP rather than one per viewer.
For a stream with 100,000 concurrent viewers downloading 2 MB/s each, total bandwidth is 200 Gbps. The CDN handles this. The origin handles only the initial cache miss for each segment (once per PoP per 2-6 seconds), a tiny fraction of total requests.
Low Latency Techniques
Standard HLS with 6-second segments produces 20-30 seconds of end-to-end latency (ingest + transcoding + 3 segments in buffer). For interactive live streaming, this is too high.
Low-Latency HLS (LL-HLS): Apple’s extension to HLS. Segments are divided into “parts” of 200ms-1s. Players can download parts before the full segment is complete. The playlist includes hints for the next part so players can pre-fetch. Achieves 2-5 second latency over HTTP.
CMAF (Common Media Application Format): a container format that enables “chunked transfer” delivery – the server starts sending the segment over an open HTTP connection as it is being written, rather than waiting for the full segment. Works with both HLS and DASH. Reduces latency by the segment duration.
WebRTC: achieves sub-500ms latency but uses peer-to-peer or SFU (selective forwarding unit) architecture that does not scale to millions of viewers per stream. Used for interactive scenarios (video calls, small watch parties) but not for broadcast-scale live streaming.
Chat System Design
Live chat is a fan-out problem: one viewer sends a message that must be delivered to all other viewers of the same stream within a second or two. At 100,000 concurrent viewers, this is 100,000 WebSocket connections that need to receive each message.
Architecture: viewers hold persistent WebSocket connections to chat servers. Chat servers are stateless – any server can accept incoming messages. When a viewer sends a message, the receiving chat server publishes it to a Redis pub/sub channel keyed by stream ID. All chat servers subscribed to that channel (those with viewers watching that stream) receive the message and push it to their connected WebSocket clients.
For very large streams (100K+ viewers), Redis pub/sub fan-out to chat servers can bottleneck. Partition chat servers by stream and use consistent routing so all viewers of a stream connect to the same subset of servers, reducing pub/sub load.
Moderation: rule-based filters (banned words, regex patterns) applied synchronously before message delivery. ML-based classifiers for hate speech and harassment run asynchronously – flag messages for human review rather than blocking in the hot path. Slow mode rate-limits each user to one message per N seconds, enforced with a Redis key per (user, stream) with a TTL.
Storage and VOD
During a live stream, transcoded segments are stored in object storage (S3) as they are produced. When the stream ends, the platform stitches the segments into full video files and creates a VOD (video on demand) asset. Additional transcoding runs may produce formats not generated during the live stream (for example, higher quality encodes that take longer than real-time).
Thumbnail generation runs as an async job: sample frames from the segment files, run through a thumbnail selection model (sharpness, face detection, text avoidance), store thumbnails in object storage. VOD files are served through the same CDN as live segments.
Storage costs are significant: 1 hour of 1080p live stream at 5 Mbps = 2.25 GB per quality tier. With 6 tiers, ~13 GB per stream-hour. At 30,000 concurrent streams averaging 2 hours, daily storage growth is ~780 TB. A lifecycle policy moves older VODs to cheaper storage tiers (S3 Glacier) and deletes after a retention period unless the creator opts into permanent storage.
Scale Numbers
Reference numbers for interview context:
- Twitch peak: ~8 million concurrent viewers, ~30,000 concurrent streams
- Each stream: 6-8 transcoded quality variants produced in real time
- Segment cadence: one new HLS segment every 2-6 seconds per stream per quality = ~1.5M segment files created per minute across all streams
- CDN handles 99%+ of viewer bandwidth; origin sees only cache-miss traffic
- Ingest bandwidth per stream: 6-15 Mbps (streamer upload)
- Aggregate viewer bandwidth: tens of Tbps across the CDN at peak
- Chat at scale: a single popular stream can generate 50,000+ messages per minute
Twitch is the canonical live streaming platform. See system design questions for Twitch interview: live streaming and video delivery design.
Netflix uses streaming architecture at massive scale. See system design patterns for Netflix interview: video streaming and CDN delivery design.
Snap uses live video and streaming for Stories. See system design patterns for Snap interview: live stories and real-time video delivery.
See also: Anthropic Interview Guide 2026: Process, Questions, and AI Safety
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering