System Design Interview: Design a Live Streaming Platform (Twitch)
Designing a live streaming platform like Twitch differs significantly from on-demand video (YouTube/Netflix). The key challenges are ultra-low latency ingest, real-time transcoding, ephemeral content delivery, and massive concurrent viewership for popular streams.
Requirements Clarification
Functional Requirements
- Streamers broadcast live video from desktop/mobile
- Viewers watch streams with low latency (<10 seconds)
- Live chat alongside the stream
- Stream discovery: browse by game, category, viewer count
- Stream recording and VOD (video on demand) replay
Non-Functional Requirements
- Streamers: 100K concurrent live streams
- Viewers: 10M concurrent viewers (100 viewers/stream avg, 100K for top streams)
- Ingest latency: <5 seconds from streamer to viewer
- Availability: 99.99% for ingest; 99.9% for playback
Key Difference: Live vs On-Demand Video
- Content is ephemeral: generated in real-time, cannot be pre-transcoded
- Low latency required: viewers expect near-real-time delivery (chat reactions match stream)
- CDN prefetching not possible: content not known in advance
- Segment sizes smaller: 1-2 seconds vs 2-10 seconds for VOD to reduce latency
- Ingest infrastructure critical: failure = stream goes down for broadcaster
High-Level Architecture
Streamer
| (RTMP/SRT push)
Ingest Edge Server (closest to streamer)
|
Ingest Backend (transcode in real-time)
- FFmpeg: 160p, 360p, 480p, 720p, 1080p
- Output: HLS segments (1-2 sec each)
|
Segment Storage (object store, short TTL)
|
CDN (live origin per stream, pull from storage)
|
Viewers (HLS player, adaptive bitrate)
Video Ingest: RTMP / SRT
Streamers push video using RTMP (Real-Time Messaging Protocol) or SRT (Secure Reliable Transport). RTMP is legacy but widely supported by streaming software (OBS, Streamlabs). SRT is newer, handles packet loss better on poor networks. Streamer connects to nearest ingest edge server (anycast or DNS-based geolocation). Ingest server receives the RTMP stream and pushes to transcoding cluster.
Real-Time Transcoding
Unlike VOD, live video must be transcoded as it arrives. Each incoming stream spawns transcoding workers:
- GPU-accelerated transcoding (NVIDIA NVENC, AMD AMF)
- Segment size: 1-2 seconds for low latency (vs 4-10s for VOD)
- Output: HLS segments + updated .m3u8 manifest with new segment
- Segment stored to fast object store (S3 with short TTL, or local NVMe cache)
Transcoding capacity planning: 100K streams x 5 quality levels x 1 CPU core/stream = 500K CPU cores. Use GPU transcoding to reduce this 10x.
Low-Latency HLS (LL-HLS)
Standard HLS with 4-second segments gives 20-30s latency (segments buffered 3-5x). Low-Latency HLS (Apple LL-HLS) and Low-Latency DASH reduce this to 2-5s:
- Partial segments (0.5-1s chunks within a 2s segment)
- Push delivery (server pushes new chunks as available) vs polling
- Playlist preloading (client fetches next playlist before current expires)
CDN for Live Streams
Live streams have different CDN characteristics than VOD:
- Cannot pre-warm cache (content unknown)
- Each segment is fresh, so CDN fill is from origin every time for first viewers
- For popular streams (100K viewers): CDN edge serves most requests; origin only handles cache fill
- Segment TTL: 30-60 seconds (old segments useless to viewers)
- Use CDN with live-optimized origin shield to prevent origin overload
Live Chat
Live chat requires real-time bidirectional messaging for thousands of concurrent users per stream:
- WebSocket connections: each viewer holds a WebSocket connection to chat server
- Pub/Sub: stream_id as channel, chat servers subscribe via Redis Pub/Sub or Kafka
- Rate limiting: per-user message rate (1 msg/sec), per-stream aggregate (1000 msg/sec for top streams)
- Moderation: regex filter + ML model for hate speech, run async
- Scale: 100K streams x 100 viewers avg = 10M WebSocket connections
Stream Discovery
Viewers browse streams by: game/category, viewer count, language, tags. Stream metadata stored in Elasticsearch for full-text search. Viewer counts aggregated in real-time via Flink (viewers connect/disconnect events). Cache top streams per category in Redis (refresh every 30s).
VOD Recording
Record stream segments to S3 as they are generated. After stream ends, concatenate segments into full VOD file, trigger transcoding for additional quality levels, generate thumbnail timeline. VOD served via standard CDN (same as YouTube architecture).
Interview Tips
- Emphasize live vs VOD differences – this shows deep understanding
- Explain RTMP ingest and real-time transcoding pipeline
- Discuss LL-HLS for low latency and why 1-2s segments matter
- Address chat as a separate real-time system (WebSocket + pub/sub)
- Know CDN cache characteristics for live vs on-demand content
- GPU transcoding for cost-effective live encoding at scale