Live streaming is fundamentally harder than video-on-demand: there is no pre-encoded content to cache, latency between broadcaster and viewer must be minimized, and the system must handle sudden viewer spikes around popular streams. This guide covers the low-level design of each component.
RTMP Ingest
Broadcasters use streaming software (OBS Studio, Streamlabs, hardware encoders) to push a live stream via RTMP (Real-Time Messaging Protocol) to an ingest server. RTMP runs over TCP port 1935 and delivers an interleaved H.264 video + AAC audio bitstream in small chunks called RTMP chunks (default 128 bytes, negotiated in handshake).
The RTMP ingest server:
- Accepts the TCP connection and completes the RTMP handshake (C0/S0 version byte, C1/S1/C2/S2 timestamp + random bytes).
- Receives the
connectandpublishcommands. Thepublishcommand includes the stream key in the stream name field. - Validates the stream key against the database (described below).
- Demuxes the RTMP chunk stream into raw H.264 NAL units and AAC frames.
- Forwards the raw bitstream to the transcoding pipeline via a local socket or shared memory segment.
Ingest servers are deployed in multiple regions. DNS-based geo-routing (or a custom RTMP load balancer) directs broadcasters to the nearest ingest point to minimize upload latency and packet loss. Each ingest server handles hundreds of concurrent streams; vertical scaling is limited by network bandwidth (a single 1080p60 stream is ~6 Mbps, so a 10 Gbps uplink handles ~1600 streams with headroom).
Transcoding Pipeline
The ingest server passes the raw H.264/AAC bitstream to a transcoding process (FFmpeg or a custom transcoder). Unlike VOD transcoding, live transcoding operates under a hard real-time constraint: output segments must be produced faster than they are consumed by viewers, or the stream falls behind and latency grows unboundedly.
The transcoder produces multiple output bitrates in parallel using GPU-accelerated encoding (NVENC on NVIDIA, VideoToolbox on Apple, AMF on AMD) or highly optimized software encoders (x264 with tune=zerolatency):
Live quality ladder:
360p — 600 kbps (fallback for poor connections)
720p — 2500 kbps (standard quality)
1080p — 5000 kbps (high quality, broadcaster must send at minimum this bitrate)
The transcoder segments output into short HLS chunks (2 seconds for LL-HLS, 6 seconds for standard HLS) and writes them to a local disk buffer before pushing to the CDN origin. Keyframe interval must be aligned to segment boundaries — the broadcaster is instructed to set keyframe interval = segment duration to enable clean segment cuts without re-encoding.
CPU budget: real-time 1080p transcoding to three output bitrates requires approximately 4–8 CPU cores per stream depending on encoder preset. GPU encoding reduces this dramatically. Live transcoding is 3–5x more CPU-intensive per output minute than VOD transcoding of the same content because the encoder has no look-ahead (it cannot buffer future frames to make better encoding decisions).
HLS Low-Latency Delivery
Standard HLS has 20–30 seconds of latency (3 segments × 6-second duration + player buffer). LL-HLS (Low-Latency HLS, RFC draft by Apple, now part of HLS spec) reduces this to 3–5 seconds using two mechanisms:
Partial segments: The transcoder publishes partial segment files (typically 200 ms duration) before the full segment is complete. The playlist is updated with #EXT-X-PART tags referencing each partial. The player can start downloading partial segments immediately without waiting for the full segment.
Playlist push / blocking playlist reload: Instead of polling the playlist every segment duration, the player sends a playlist request with a _HLS_msn and _HLS_part query parameter specifying the next expected sequence number. The server holds the response open (long-poll) until that segment/part is available, then responds immediately. This eliminates polling delay.
DASH-LL (DASH Low Latency) achieves similar results using chunked transfer encoding to stream segments as they are produced, combined with short segment durations. WebRTC-based streaming (e.g., Millicast) achieves sub-second latency but requires specialized infrastructure and does not scale to millions of viewers as cost-effectively as HLS/DASH over CDN.
Stream Key Authentication
Each channel has a unique stream key used to authenticate RTMP publishes:
stream_keys (
channel_id UUID PRIMARY KEY,
key_hash TEXT, -- bcrypt hash of the stream key
created_at TIMESTAMP,
last_used_at TIMESTAMP,
active BOOLEAN
)
The stream key is generated as a cryptographically random 32-byte value, base64url-encoded to a 43-character string. It is shown to the user once in the dashboard and never stored in plaintext — only the bcrypt hash is stored. On RTMP connect, the ingest server extracts the key from the stream name, hashes it, and compares against the stored hash.
Security considerations:
- Keys are rate-limited at validation: 5 failed attempts per minute per source IP triggers a temporary block to prevent key scanning.
- On Terms of Service violation, the key is immediately set
active = false. The ingest server checks key validity on every RTMP publish command, so a banned stream is dropped within seconds. - Users can regenerate their stream key from the dashboard at any time (e.g., if it is accidentally exposed). The old key is immediately invalidated.
- RTMP connections should be over RTMPS (RTMP over TLS) to prevent key interception in transit, but many encoders only support plain RTMP — this is a known ecosystem limitation.
Viewer Chat at Scale
Live stream chat is high-volume, low-persistence: messages are relevant for seconds, not years. The architecture prioritizes throughput and low delivery latency over durability.
Message flow:
- Viewer sends a chat message via WebSocket to a chat server.
- Chat server publishes the message to Kafka topic
chat_messages, partitioned bychannel_id. This provides ordering within a channel. - Chat servers subscribe to their assigned channel partitions via Redis pub/sub (one pub/sub channel per stream channel). When a message arrives in Kafka, the consumer publishes it to Redis; all chat servers subscribed to that channel receive it and fan it out to their connected viewers.
- Moderation bot subscribes to the same Kafka partition, runs heuristic and ML-based toxicity detection, and publishes delete events for violating messages.
At peak load (a major esports event may have 500k+ concurrent viewers all chatting), chat servers must handle enormous fan-out. Chat servers are scaled horizontally and load-balanced by channel_id hash so that all connections watching a given channel are served by the same set of servers — reducing the Redis pub/sub fan-out hop count. Chat history is persisted to Cassandra for replay (paginated scrollback), but the hot path does not touch the database.
CDN Edge Delivery
The transcoding server pushes completed HLS segments to a CDN origin server via HTTP PUT. The origin stores segments in memory or on fast local SSD for the duration of the DVR window (typically 3 hours). Edge POPs cache-pull segments on first viewer request and serve subsequent requests from cache.
Cache TTL for live segments must match the segment duration — a 2-second LL-HLS partial segment has a 2-second TTL. Setting TTL longer would cause viewers to receive stale playlists and fall behind the live edge. Manifest files (.m3u8) have TTL = segment duration / 2 to ensure rapid propagation of new segments.
Edge pre-warming: when a stream starts, the ingest system notifies a CDN control plane API. The CDN pre-warms the initial playlist and first few segments to high-traffic edges in the broadcaster’s primary viewer regions. This eliminates the cold-start latency spike that would otherwise occur when thousands of viewers click "Watch" simultaneously at stream start.
Stream Health Monitoring
The ingest server emits a telemetry event every 1 second per active stream to a monitoring Kafka topic:
{
stream_id: UUID,
channel_id: UUID,
ts: UNIX_MS,
input_bitrate_kbps: INT,
dropped_frames: INT, -- frames dropped by encoder in this interval
keyframe_interval_ms: INT, -- measured distance between keyframes
av_sync_delta_ms: INT, -- audio/video sync offset
encoder_fps: FLOAT,
buffer_depth_ms: INT -- transcoder input buffer fullness
}
Alerting thresholds:
- Dropped frames > 1% of expected frames → warn broadcaster via dashboard overlay.
- Input bitrate drops below 50% of expected → alert ops, prepare for stream stall.
- Buffer depth = 0 for 3 consecutive seconds → stream stall detected, attempt auto-restart of transcoding pipeline.
- AV sync delta > 200 ms → flag for manual review (encoder misconfiguration).
Stream health data is also surfaced to viewers: the player reports rebuffering events to the same monitoring pipeline, enabling correlation between ingest-side issues and viewer-side impact in real time.
Concurrent Viewer Handling
The HLS pull model is inherently horizontally scalable on the CDN side. Viewers fetch segments from CDN edge POPs; the origin only sees cache miss traffic. For a popular stream with 1 million concurrent viewers fetching a new segment every 2 seconds, and a CDN cache hit rate of 99.9%, the origin sees only ~500 requests/second — manageable by a small cluster of origin servers.
WebSocket chat servers scale separately from the video delivery path. A channel with 500k concurrent viewers requires roughly 500 chat servers at 1000 connections each (though viewer-to-chatter ratio is typically 50:1, so far fewer viewers send messages). Chat servers are load balanced by channel_id hash — all connections for the same channel are assigned to the same server group, reducing pub/sub fan-out. Kubernetes HPA scales chat server pods based on connection count and CPU metrics. When a stream ends, connections are drained gracefully and pods scale down within minutes.