Designing a live streaming platform is fundamentally different from designing video-on-demand (YouTube). The challenge is latency — live content must be delivered with seconds of delay, not minutes, while simultaneously supporting millions of concurrent viewers. Twitch, YouTube Live, and Facebook Live all solve this problem with variations of the same architecture.
Key Differences: Live vs VOD
| Dimension | Live Streaming | VOD (YouTube) |
|---|---|---|
| Content source | Real-time encoder (streamer) | Pre-uploaded file |
| Latency requirement | 2-30 seconds (low-latency: <2s) | 0 — viewer controls playback |
| CDN caching | Segments expire in seconds | Segments cached for hours/days |
| Scalability spike | Sudden: 0 → 1M viewers in seconds | Gradual: search/recommendation driven |
| Storage | Grows in real time; VOD archive after | Fixed at upload time |
High-Level Architecture
INGESTION (1 stream from streamer):
[OBS / Streaming Software] → RTMP → [Edge Ingest Server (nearest PoP)]
↓
[Ingest Processing Cluster]
- Transcode to HLS
- Multiple bitrates (1080p, 720p, 480p, 360p)
- Generate .m3u8 playlist + .ts segments
DISTRIBUTION (1M viewers):
[HLS Segments stored in S3 + CDN origin]
↓ (CDN pull on first request)
[CDN Edge Nodes (CloudFront, Fastly, Akamai)]
↓
[Viewer browsers/apps — fetch new segments every 2-10s]
Video Ingestion: RTMP Protocol
Streamers use Open Broadcaster Software (OBS) or streaming apps that push video via RTMP (Real-Time Messaging Protocol) to the nearest ingest point-of-presence (PoP). RTMP is a TCP-based protocol designed for low-latency push streaming.
Ingest flow:
1. Streamer pushes RTMP stream to closest ingest server
URL: rtmp://live.twitch.tv/app/{stream_key}
2. Ingest server authenticates stream_key against auth service
3. Ingest server relays stream via private network to
transcoding cluster in primary datacenter
4. Transcoding cluster:
- Decodes incoming video (H.264 or H.265)
- Re-encodes at 4-5 bitrate levels:
1080p60: 6 Mbps
720p60: 4.5 Mbps
480p30: 1.5 Mbps
360p30: 0.8 Mbps
160p30: 0.3 Mbps
- Segments each stream into 2-6 second .ts chunks
- Generates HLS master playlist (.m3u8)
- Uploads chunks to S3 (origin) continuously
HLS (HTTP Live Streaming)
HLS works by breaking the stream into small video segments that viewers download sequentially. The master playlist tells the player which quality levels are available; the media playlist for each quality level lists the segment files.
# Master playlist (m3u8) — quality selection
#EXTM3U
#EXT-X-STREAM-INF:BANDWIDTH=6000000,RESOLUTION=1920x1080
https://cdn.twitch.tv/stream/abc123/1080p60/index.m3u8
#EXT-X-STREAM-INF:BANDWIDTH=4500000,RESOLUTION=1280x720
https://cdn.twitch.tv/stream/abc123/720p60/index.m3u8
#EXT-X-STREAM-INF:BANDWIDTH=800000,RESOLUTION=640x360
https://cdn.twitch.tv/stream/abc123/360p30/index.m3u8
# Media playlist (updates every 2s with new segment)
#EXTM3U
#EXT-X-TARGETDURATION:2
#EXT-X-VERSION:3
#EXTINF:2.0,
segment_001.ts (1-3 minutes ago — already in CDN cache)
#EXTINF:2.0,
segment_002.ts
#EXTINF:2.0,
segment_003.ts (most recent — may not be in CDN cache yet)
Live Latency vs Buffer Trade-off
| Mode | Latency | Buffering Risk | Use Case |
|---|---|---|---|
| Normal HLS | 15-30 seconds | Low | Most live streams |
| Low-latency HLS (LL-HLS) | 2-5 seconds | Medium | Sports, gaming, interactive |
| WebRTC | < 500ms | High | Two-way communication, small audiences |
CDN Scaling for Live Streams
A popular Twitch stream might go from 0 to 500,000 viewers in minutes. Each viewer fetches a new segment every 2 seconds. That is 250,000 requests per second for one stream — CDN caching is essential.
CDN strategy for live segments:
- Segment cache TTL: 2-4x segment duration (e.g., 6 seconds for 2s segments)
- After TTL: CDN fetches new segment from origin (S3)
- Origin shield: one regional CDN node fetches from S3, all edge nodes in
that region serve from the shield — reduces origin load by 99%
Cache hit rate for popular streams: >99%
(500,000 viewers share 1 cache, not 500,000 origin requests)
For unpopular streams: origin can serve directly (few viewers)
Chat System at Scale
Twitch chat is a separate system from the video stream. Popular channels have 100,000+ concurrent chat users and messages arrive at thousands per second.
Architecture:
[Viewer browser] ←→ WebSocket ←→ [Chat Server]
↓
[Message Fan-out]
(Redis pub/sub or Kafka)
↓
[All connected Chat Servers
for this channel]
↓
[Each pushes to connected viewers]
Chat message flow (simplifed):
1. Viewer sends chat message via WebSocket
2. Chat server validates (authentication, rate limit, ban check)
3. Message published to Redis pub/sub channel: chat:{channel_id}
4. All chat servers subscribed to that channel receive it
5. Each server broadcasts to its connected viewers
At 100,000 concurrent viewers: ~100 chat servers, each handles ~1,000 connections
Redis pub/sub handles fan-out — 1 publish → 100 subscribers
Each subscriber pushes to 1,000 WebSocket connections
Adaptive Bitrate Streaming (ABR)
The player automatically switches quality levels based on available bandwidth. The ABR algorithm tracks download speed for recent segments and adjusts quality to maximize quality without stalling.
class ABRController:
def __init__(self, quality_levels: list[tuple[int, int]]):
# quality_levels: [(bitrate_kbps, height_p), ...]
self.levels = sorted(quality_levels) # ascending by bitrate
self.current_level = len(self.levels) // 2 # start at middle quality
self.download_speeds = [] # sliding window of recent speeds
def update_speed(self, bytes_downloaded: int, time_seconds: float):
speed_kbps = (bytes_downloaded * 8) / (time_seconds * 1000)
self.download_speeds.append(speed_kbps)
if len(self.download_speeds) > 5:
self.download_speeds.pop(0)
def choose_quality(self) -> int:
if not self.download_speeds:
return self.current_level
avg_speed = sum(self.download_speeds) / len(self.download_speeds)
# Use 80% of estimated bandwidth to leave headroom
available = avg_speed * 0.8
# Find highest quality that fits in available bandwidth
best = 0
for i, (bitrate, _) in enumerate(self.levels):
if bitrate <= available:
best = i
# Hysteresis: only upgrade by 1 level at a time to avoid flapping
self.current_level = min(best, self.current_level + 1)
return self.current_level
Stream Recording and VOD Archive
Live stream ends:
1. Transcoder signals end-of-stream
2. All segments already uploaded to S3
3. Concatenation job merges segments into full video files
4. Generates thumbnail, chapter markers
5. Creates permanent VOD entry in database
6. CDN cache for old segments extended from 6s TTL to 24h TTL
7. Available as VOD within ~2 minutes of stream ending
Interview Discussion Points
- How do you handle a streamer losing internet mid-stream? Ingest server holds a buffer; if connection resumes within 5 seconds, stream continues. If gap is longer, stream ends and VOD is finalized
- How do you scale the transcoding cluster for 10,000 simultaneous streams? Auto-scaling EC2/GKE spot instances triggered by stream start events; pre-warm capacity during known peak hours (US evening)
- How do you prevent stream key sharing/piracy? RTMP auth checks IP + user agent; multiple auth failures block the stream; viewers can report
- What is the difference between Twitch and YouTube Live architecturally? Both use RTMP ingest + HLS distribution. Twitch focuses on sub-second latency for interactive gaming chat; YouTube prioritizes global CDN for massive audiences with lower latency requirements
Frequently Asked Questions
What is HLS and how does it enable live streaming?
HLS (HTTP Live Streaming) works by breaking a continuous video stream into small segment files (typically 2-6 seconds) uploaded to an HTTP server or CDN. A manifest file (m3u8 playlist) lists the available segments and is updated continuously as new segments are generated. Viewers download the playlist, then fetch segments in order, buffering a few seconds ahead. This design is ideal for CDN delivery because standard HTTP caching can be applied to segments. The tradeoff is latency — normal HLS adds 15-30 seconds of delay. Low-Latency HLS (LL-HLS) reduces this to 2-5 seconds by using partial segments and optimistic loading. WebRTC achieves under 500ms but does not scale to millions of viewers.
How does adaptive bitrate streaming work in a live platform like Twitch?
The transcoding cluster encodes the live stream at multiple quality levels simultaneously (1080p, 720p, 480p, 360p) and generates separate HLS playlists for each. The master playlist lists all available qualities. The video player monitors download speed for each segment: if a 2-second segment downloads in 0.5 seconds, the player has 3x headroom and can upgrade quality. If a segment takes longer to download than its duration, the player must reduce quality to avoid buffering. The ABR algorithm (Adaptive Bitrate) makes these decisions — choosing the highest quality level that fits in the available bandwidth with a safety margin, applying hysteresis to avoid flapping between qualities.
How do you scale live streaming chat to 100,000 concurrent viewers?
Live chat is a fan-out problem: one message must be delivered to 100,000 viewers simultaneously. The architecture uses multiple chat servers, each holding WebSocket connections to a subset of viewers. When a viewer sends a message, their chat server publishes it to a Redis pub/sub channel for that stream. All chat servers subscribe to that channel and receive every message. Each server then pushes the message to its connected viewers over WebSocket. At 100,000 viewers with 1,000 connections per server, you need 100 chat servers. Redis pub/sub handles the fan-out efficiently — one publish triggers 100 deliveries. Rate limiting per user (messages per second) prevents spam floods.
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What is HLS and how does it enable live streaming?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “HLS (HTTP Live Streaming) works by breaking a continuous video stream into small segment files (typically 2-6 seconds) uploaded to an HTTP server or CDN. A manifest file (m3u8 playlist) lists the available segments and is updated continuously as new segments are generated. Viewers download the playlist, then fetch segments in order, buffering a few seconds ahead. This design is ideal for CDN delivery because standard HTTP caching can be applied to segments. The tradeoff is latency — normal HLS adds 15-30 seconds of delay. Low-Latency HLS (LL-HLS) reduces this to 2-5 seconds by using partial segments and optimistic loading. WebRTC achieves under 500ms but does not scale to millions of viewers.”
}
},
{
“@type”: “Question”,
“name”: “How does adaptive bitrate streaming work in a live platform like Twitch?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The transcoding cluster encodes the live stream at multiple quality levels simultaneously (1080p, 720p, 480p, 360p) and generates separate HLS playlists for each. The master playlist lists all available qualities. The video player monitors download speed for each segment: if a 2-second segment downloads in 0.5 seconds, the player has 3x headroom and can upgrade quality. If a segment takes longer to download than its duration, the player must reduce quality to avoid buffering. The ABR algorithm (Adaptive Bitrate) makes these decisions — choosing the highest quality level that fits in the available bandwidth with a safety margin, applying hysteresis to avoid flapping between qualities.”
}
},
{
“@type”: “Question”,
“name”: “How do you scale live streaming chat to 100,000 concurrent viewers?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Live chat is a fan-out problem: one message must be delivered to 100,000 viewers simultaneously. The architecture uses multiple chat servers, each holding WebSocket connections to a subset of viewers. When a viewer sends a message, their chat server publishes it to a Redis pub/sub channel for that stream. All chat servers subscribe to that channel and receive every message. Each server then pushes the message to its connected viewers over WebSocket. At 100,000 viewers with 1,000 connections per server, you need 100 chat servers. Redis pub/sub handles the fan-out efficiently — one publish triggers 100 deliveries. Rate limiting per user (messages per second) prevents spam floods.”
}
}
]
}