Introduction
Live streaming requires low-latency ingest, real-time transcoding to multiple bitrates, and global CDN distribution to millions of concurrent viewers. The system must handle unpredictable stream starts, bursty viewer spikes, and sub-second segment availability while remaining cost-efficient for the long tail of low-viewer streams.
Ingest Pipeline
Streamers push an RTMP stream to the ingest edge server geographically closest to them. The ingest server validates the stream key and maps it to a channel_id. It then splits the incoming stream into 2-second GOPs (Groups of Pictures), where each GOP begins with a keyframe to allow independent decoding. Completed GOPs are published to a stream broker — either an internal Kafka cluster or a media server cluster such as SRS — for downstream consumption by transcoding workers.
Transcoding
A dedicated transcoding cluster picks up each GOP and re-encodes it to multiple renditions: 1080p at 6 Mbps, 720p at 3 Mbps, 480p at 1.5 Mbps, and 360p at 0.8 Mbps. Encoding is GPU-accelerated using NVENC (NVIDIA) or AMD VCE to minimize latency and cost per stream. Output segments are written to S3 and the HLS manifest (.m3u8) is updated every 2 seconds. End-to-end latency from capture to viewer is 10–30 seconds with standard HLS, or 2–4 seconds with Low-Latency HLS or a WebRTC relay path.
HLS Adaptive Bitrate
The viewer’s player requests the master .m3u8, which lists all available renditions. The player selects a rendition based on measured available bandwidth, then requests 2-second .ts media segments sequentially. If bandwidth degrades, the player switches to a lower-bitrate rendition automatically without interrupting playback. Media segments are served from CDN edge nodes; the CDN pulls each segment from the S3 origin on first request and caches it for the segment’s duration, after which the segment is immutable and never changes.
Chat and Interactions
Live chat messages are delivered via WebSocket connections to chat servers. Each channel has a logical chat room. When a user sends a message, it is published to a Redis pub/sub channel for that room. All chat servers subscribed to that channel receive the message and fan it out to their connected viewers. Per-user rate limiting (e.g., 2 messages per second) is enforced at the chat server using a sliding window counter in Redis. Incoming messages pass through an ML-based toxicity filter before being broadcast. Emotes and cheers are handled as distinct structured event types alongside plain text messages.
Viewer Count Estimation
Exact counting of concurrent viewers is expensive at scale due to the cardinality of active sessions. Instead, use Redis HyperLogLog: viewer players send a heartbeat every 30 seconds, and each heartbeat calls PFADD channel:{id}:viewers {viewer_id}. PFCOUNT returns the estimated unique viewer count with 0.81% standard error. Stale viewers (no heartbeat in the last 60 seconds) are handled by rolling HyperLogLog keys with a time bucket (e.g., per-minute keys merged for the display count).
VOD Replay
After a stream ends, a post-processing job concatenates all stored segments into a single VOD file and generates a complete HLS manifest for replay. The VOD is stored in S3 Standard for the first 30 days, then transitioned to S3 Glacier for cost efficiency. A thumbnail generation job extracts keyframes at 10-second intervals from the VOD to produce a sprite sheet used by the player’s scrubbing preview feature.
Frequently Asked Questions: Live Video Streaming Platform
What are the latency tradeoffs between HLS, DASH, and LL-HLS for live video streaming?
Standard HLS targets 15-30 seconds of end-to-end latency because the player must buffer multiple full segments (typically 6-10 seconds each) before playback. DASH (Dynamic Adaptive Streaming over HTTP) achieves similar latency but is codec-agnostic and offers more flexible segment templates. Low-Latency HLS (LL-HLS, RFC 8216bis) reduces latency to 2-4 seconds by splitting segments into partial segments of 200ms and using HTTP/2 push or blocking playlist requests to deliver them as soon as they are encoded. The tradeoff is increased origin load and more complex CDN configuration; LL-HLS also requires all cache layers to support partial segment caching to avoid defeating the low-latency guarantee.
How does GPU acceleration improve live video transcoding throughput?
Software transcoding with x264 or x265 on CPU cores is compute-bound and typically yields 1-4x real-time speed per core for 1080p. GPU-based encoders such as NVIDIA NVENC or AMD AMF offload the motion estimation and entropy coding stages to dedicated silicon, achieving 8-30x real-time speed at a fraction of the CPU cost. This matters for live streams where the encode must complete faster than real time — a single NVENC instance can handle multiple concurrent 1080p streams that would otherwise require an entire CPU server. The quality-per-bit is slightly lower than a slow CPU encode, but for live streaming the latency constraint makes the quality tradeoff acceptable.
How do you implement WebSocket chat with Redis pub/sub fan-out at scale?
Each chat server maintains long-lived WebSocket connections to its local clients and subscribes to a Redis channel per stream (e.g. chat:stream:{stream_id}). When a viewer sends a message, their chat server publishes it to Redis. Redis broadcasts the message to all subscribed chat servers, each of which fans it out to their local WebSocket connections. This decouples horizontal scaling of the WebSocket tier from the fan-out logic. At very high subscriber counts (millions of concurrent viewers), replace the per-stream Redis channel with a tiered fan-out: a small number of relay servers subscribe to Redis and maintain WebSocket pools to downstream edge servers, which hold the actual viewer connections.
Why use HyperLogLog for live viewer count estimation instead of exact counting?
Exact counting with a Redis Set requires O(N) memory proportional to the number of distinct viewer IDs, which becomes gigabytes for popular streams. HyperLogLog (HLL) provides a cardinality estimate with a standard error of 0.81% using at most 12 KB of memory per key, regardless of cardinality. Redis exposes PFADD to record a viewer session and PFCOUNT to read the estimate. Multiple HLL keys can be merged with PFMERGE for aggregated counts across regions. For a live viewer ticker that updates every few seconds, the sub-1% error is imperceptible to users and the fixed memory cost makes it operationally safe to maintain one HLL per active stream.
What is the optimal CDN segment caching strategy for live HLS streams?
Set a short TTL (equal to segment duration, e.g. 6 seconds) on the live playlist manifest (*.m3u8) so players always fetch a fresh segment list, but set a long or infinite TTL on completed media segments (*.ts or *.mp4 fMP4 chunks) because segment content is immutable once written. Use surrogate keys or cache tags to allow instant purge of the manifest without touching segment objects. For LL-HLS, configure CDN to support HTTP/2 server push or hold-until-complete on partial segment requests. Place CDN PoPs close to viewer clusters and configure origin shield to collapse the thundering herd of manifest requests from millions of simultaneous players into a single upstream fetch per PoP.
See also: Meta Interview Guide 2026: Facebook, Instagram, WhatsApp Engineering
See also: Snap Interview Guide