System Design Interview: WebRTC and Real-Time Video Architecture

WebRTC Architecture Overview

WebRTC (Web Real-Time Communication) is an open standard that enables peer-to-peer audio and video in browsers and native apps without plugins. It is the underlying technology in Google Meet, Discord voice, Zoom (partially), and thousands of other video products. WebRTC handles the hard parts of real-time media: audio/video encoding, network traversal (NAT punch-through), congestion control, jitter buffering, and packet loss concealment — all in real-time at < 150ms latency.

WebRTC Core Components

  • RTCPeerConnection: the main API. Manages the media and data channels between two peers, handles encryption (DTLS-SRTP mandatory), and drives the ICE (Interactive Connectivity Establishment) process to find the best network path.
  • MediaStream: represents a stream of audio/video tracks captured from camera, microphone, or screen share.
  • RTCDataChannel: reliable or unreliable bidirectional data channel (like UDP or TCP over the peer connection). Used for: game state, chat messages, file transfer, screen annotations.

NAT Traversal: STUN and TURN

Most users are behind NAT (home routers, corporate firewalls). Peer-to-peer requires both peers to know each other’s public IP and port. This is harder than it sounds when both are behind NAT.


# STUN (Session Traversal Utilities for NAT):
# STUN server tells you your own public IP/port (as seen from the internet)
# Process:
1. Client sends UDP packet to STUN server (e.g., stun.l.google.com:19302)
2. STUN server responds with the client's observed public IP:port
3. Client includes this "server-reflexive candidate" in its ICE candidates
4. If both peers are behind simple NAT (no symmetric NAT), they can connect
   directly using this information

# ICE candidate types:
# host candidate: local LAN IP (192.168.1.5:54321)
# server-reflexive: public IP from STUN (203.0.113.5:54321)
# relay candidate: TURN server IP (for fallback)

# TURN (Traversal Using Relays around NAT):
# When direct peer-to-peer fails (symmetric NAT, corporate firewall),
# traffic is relayed through a TURN server
# TURN is a media relay: both peers send to the TURN server, which forwards
# Cost: TURN server bandwidth = sum of all relayed streams
# ~10-30% of WebRTC sessions require TURN relay
# TURN servers are expensive at scale (Netflix, Zoom spend millions/month on TURN)

# ICE negotiation:
1. Both peers gather candidates (host, STUN, TURN)
2. Exchange candidates via signaling server
3. ICE agent tries all candidate pairs (peer A host ↔ peer B host, etc.)
4. First successful connectivity check wins (lowest latency path)
5. Connection upgrades to DTLS-SRTP encrypted stream

Signaling Server

WebRTC does not specify how peers find each other or exchange SDP (Session Description Protocol) offers/answers — that is signaling. Signaling is application-defined and typically uses WebSocket.


# Signaling flow (WebSocket-based):
# Peer A wants to call Peer B

1. Peer A creates RTCPeerConnection, creates offer (SDP):
   const offer = await pc.createOffer();
   await pc.setLocalDescription(offer);
   ws.send(JSON.stringify({ type: "offer", to: "peer-b-id", sdp: offer }))

2. Signaling server (WebSocket) forwards offer to Peer B

3. Peer B receives offer, creates RTCPeerConnection, sets remote description,
   creates answer:
   await pc.setRemoteDescription(offer);
   const answer = await pc.createAnswer();
   await pc.setLocalDescription(answer);
   ws.send(JSON.stringify({ type: "answer", to: "peer-a-id", sdp: answer }))

4. Peer A receives answer, sets remote description:
   await pc.setRemoteDescription(answer);

5. ICE candidates are trickled: as each peer discovers candidates,
   they send them via WebSocket; the other peer adds them:
   ws.send(JSON.stringify({ type: "ice-candidate", candidate: event.candidate }))

6. ICE negotiation completes → DTLS handshake → media flows peer-to-peer

Topology: Mesh vs SFU vs MCU

For group calls (more than 2 participants), peer-to-peer mesh does not scale. Three topologies:

  • Mesh: every participant connects to every other participant. For N participants: N(N-1)/2 peer connections; each participant uploads N-1 streams. At 6 participants, each uploads 5 streams. At 10 participants, uploads 9 streams → too much bandwidth for most clients. Used by: small group calls in apps that avoid server costs.
  • SFU (Selective Forwarding Unit): each participant sends one upstream to the SFU server; SFU forwards the appropriate streams to each receiver without decoding/re-encoding. Participants still receive N-1 streams, but only upload one. SFU can do simulcast: client sends 3 quality levels (high/medium/low); SFU forwards the appropriate quality to each receiver based on their bandwidth. Used by: Discord, Twilio, Daily.co, Agora.
  • MCU (Multipoint Control Unit): receives all streams, decodes them, composites them into a single mixed stream, and sends one stream to each participant. Participant downloads one stream (the composite). Very expensive CPU (full decode + encode at the server). Used by: Zoom (historically), legacy conferencing systems.

# SFU architecture (modern standard):
Client A → [media] → SFU
Client B → [media] → SFU
Client C → [media] → SFU

SFU → [A's stream] → Client B, Client C
SFU → [B's stream] → Client A, Client C
SFU → [C's stream] → Client A, Client B

# Simulcast: Client A sends 3 quality layers
Client A → [1080p @ 4Mbps] → SFU → Client B (high bandwidth connection)
Client A → [720p  @ 2Mbps] → SFU → Client C (moderate bandwidth)
Client A → [360p  @ 500Kbps] → SFU → Client D (mobile data)

# SFU can also do spatial/temporal scalability with SVC codecs (VP9, AV1, H.265)

Codec Selection

  • Audio: Opus — the dominant audio codec for WebRTC. Variable bitrate (6 Kbps for voice, 64 Kbps for music). Built-in packet loss concealment, forward error correction (FEC), DTX (discontinuous transmission — zero bitrate in silence). Designed specifically for real-time communication.
  • Video: VP8 (Google, universal support), VP9 (better compression, ~50% bandwidth reduction vs VP8, better hardware support), H.264 (hardware encoding on iOS/Android, patent-encumbered), AV1 (best compression, slow encode, growing support). Discord uses VP8/VP9. Chrome defaults to VP8. Safari prefers H.264.

Adaptive Bitrate and Congestion Control


# WebRTC congestion control: GCC (Google Congestion Control)
# Based on REMB (Receiver Estimated Maximum Bitrate) and transport-cc
# The receiver estimates available bandwidth based on inter-packet arrival times
# If packets arrive increasingly late → bandwidth is constrained → send REMB
# Sender reduces bitrate; if arrival is steady → sender increases bitrate

# Bandwidth estimation cycle:
1. Receiver measures inter-packet arrival delay
2. Kalman filter estimates one-way delay gradient
3. If delay gradient > threshold → overuse detected → reduce bitrate 8%
4. Steady state → increase bitrate 5% every 1.5 seconds
5. Sender adjusts video encoder bitrate target

# Packet loss handling:
#  10% loss: reduce bitrate aggressively
# FEC (Forward Error Correction): send redundant data to recover from loss
#   Adds overhead but reduces retransmission latency
# NACK (Negative Acknowledgment): receiver requests specific lost packets
#   Adds retransmission latency but more bandwidth-efficient than FEC

Scaling SFU Infrastructure


# Meeting server selection:
# User A (London) + User B (London) + User C (New York)
# Option 1: Single SFU in London — C has higher latency
# Option 2: Cascaded SFUs — London SFU ↔ New York SFU
#   A,B → London SFU → [cascade] → New York SFU → C
# Cascade: SFU-to-SFU connection streams the media.
# Client latency is minimized by connecting to nearest SFU.
# Zoom uses this: regional media servers cascade globally.

# TURN server fleet:
# TURN servers are bandwidth-limited (10 Gbps NIC × 8 streams @ 1Mbps = 1250 simultaneous users)
# Distribute globally; route users to nearest TURN server
# Monitor relay usage: >30% TURN usage indicates firewall-heavy user base

# Recording:
# Server-side recording: SFU receives all streams, writes to disk (Opus/VP8 → WebM/MP4)
# Composited recording: decode + composite on the server (expensive — like MCU)
# Selective recording: record only active speaker to reduce storage

Interview Questions

  • Design a video conferencing system like Zoom for 1000-person webinars
  • How do you handle the “last mile” problem — a participant with a bad connection degrades the experience for everyone?
  • Design the architecture for a 1:1 video call that works without a server (peer-to-peer)
  • How do you implement screen sharing with low latency and smooth frame rate?
  • How does Discord handle voice channels where users join and leave constantly?

Frequently Asked Questions

How does WebRTC establish a peer-to-peer connection through NAT?

Most devices are behind NAT (Network Address Translation) — they have a private IP (192.168.x.x) on their local network and share a public IP with many other devices. Two NAT'd devices cannot connect directly without help because neither knows the other's public IP and port. WebRTC uses ICE (Interactive Connectivity Establishment) with STUN and TURN servers to solve this. STUN (Session Traversal Utilities for NAT): each peer sends a UDP packet to a STUN server. The STUN server responds with the peer's public IP and port as observed from the internet — this is the server-reflexive candidate. If both peers are behind simple (full-cone or port-restricted) NAT, they can connect peer-to-peer by exchanging these public addresses via the signaling server. Signaling: the exchange of ICE candidates and SDP (Session Description Protocol) offers/answers happens via a WebSocket-based signaling server. The signaling server is only needed for setup — once the peer connection is established, all media flows peer-to-peer. TURN (Traversal Using Relays around NAT): when direct peer-to-peer fails (symmetric NAT, corporate firewall that blocks UDP), both peers connect to a TURN server which relays all media between them. TURN adds latency and costs bandwidth but works in all network conditions. ICE negotiation tries all possible candidate pairs (host-to-host, STUN-to-STUN, TURN-to-TURN) and selects the lowest-latency successful path. Roughly 10-30% of WebRTC sessions require TURN relay.

What is the difference between SFU, MCU, and mesh topology for group video calls?

These three topologies differ in where media is mixed and how much bandwidth/compute each endpoint and server uses. Mesh: every participant establishes a direct WebRTC peer connection with every other participant. N participants → N(N-1)/2 connections; each participant uploads N-1 streams simultaneously. For 6 people, each uploads 5 streams (video at 1 Mbps each = 5 Mbps upload required). At 10+ participants, bandwidth requirements exceed typical home connections. No server required for media (only signaling). Used for small groups (2-4) where server cost matters. SFU (Selective Forwarding Unit): each participant sends one upstream to the SFU server. The SFU forwards each participant's stream to all other participants without decoding or re-encoding it — a pure forwarding operation. Each participant still downloads N-1 streams, but only uploads one. SFU can selectively forward: only send the active speaker's video at full quality; downsample others. SFU also enables simulcast: the sender transmits 3 quality layers (720p, 360p, 180p); the SFU forwards the appropriate layer to each receiver based on their bandwidth. Modern standard: Discord, Twilio, Daily.co, Zoom (partly). MCU (Multipoint Control Unit): the server decodes all incoming streams, composites them into a single mixed video (like a video wall), re-encodes, and sends one stream to each participant. Participant downloads one stream regardless of N. Extremely CPU-intensive at the server (decode + encode × N). Scales to many viewers but high server cost. Legacy technology; most systems have migrated to SFU. Zoom used MCU historically; has been migrating toward SFU with active speaker detection.

How does WebRTC handle network congestion and packet loss?

WebRTC uses several mechanisms to handle unreliable networks in real time. Congestion control (GCC — Google Congestion Control): the receiver measures inter-packet arrival times. If packets arrive increasingly delayed (delay gradient increases), it signals network overload. The receiver sends REMB (Receiver Estimated Maximum Bitrate) messages or transport-cc feedback to the sender. The sender responds by reducing the video encoder bitrate target. In steady state, the sender gradually increases bitrate (5% every 1.5 seconds) until it detects overload again. This probes the available bandwidth continuously. FEC (Forward Error Correction): the sender transmits redundant data packets alongside the original. If a packet is lost, the receiver can reconstruct it from the FEC packets without a retransmission round trip. FEC adds overhead (5-10% extra bandwidth) but eliminates the 150-300ms retransmission delay for recovered packets. Best for networks with moderate random loss (not bursty). NACK (Negative Acknowledgment): the receiver detects a missing packet (sequence number gap) and sends a NACK to the sender, which retransmits the specific packet. Adds one round-trip of delay but is bandwidth-efficient. Best for bursty loss where the network recovers quickly. Jitter buffer: receivers buffer incoming packets to smooth out network jitter. A 150-200ms jitter buffer absorbs timing variations without audible/visible glitches. Larger buffers reduce glitches but increase latency. The jitter buffer adapts its size based on observed jitter. The combination: GCC prevents long-term overload, FEC handles occasional loss, NACK handles burst loss, jitter buffer handles timing variation.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How does WebRTC establish a peer-to-peer connection through NAT?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Most devices are behind NAT (Network Address Translation) — they have a private IP (192.168.x.x) on their local network and share a public IP with many other devices. Two NAT’d devices cannot connect directly without help because neither knows the other’s public IP and port. WebRTC uses ICE (Interactive Connectivity Establishment) with STUN and TURN servers to solve this. STUN (Session Traversal Utilities for NAT): each peer sends a UDP packet to a STUN server. The STUN server responds with the peer’s public IP and port as observed from the internet — this is the server-reflexive candidate. If both peers are behind simple (full-cone or port-restricted) NAT, they can connect peer-to-peer by exchanging these public addresses via the signaling server. Signaling: the exchange of ICE candidates and SDP (Session Description Protocol) offers/answers happens via a WebSocket-based signaling server. The signaling server is only needed for setup — once the peer connection is established, all media flows peer-to-peer. TURN (Traversal Using Relays around NAT): when direct peer-to-peer fails (symmetric NAT, corporate firewall that blocks UDP), both peers connect to a TURN server which relays all media between them. TURN adds latency and costs bandwidth but works in all network conditions. ICE negotiation tries all possible candidate pairs (host-to-host, STUN-to-STUN, TURN-to-TURN) and selects the lowest-latency successful path. Roughly 10-30% of WebRTC sessions require TURN relay.”
}
},
{
“@type”: “Question”,
“name”: “What is the difference between SFU, MCU, and mesh topology for group video calls?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “These three topologies differ in where media is mixed and how much bandwidth/compute each endpoint and server uses. Mesh: every participant establishes a direct WebRTC peer connection with every other participant. N participants → N(N-1)/2 connections; each participant uploads N-1 streams simultaneously. For 6 people, each uploads 5 streams (video at 1 Mbps each = 5 Mbps upload required). At 10+ participants, bandwidth requirements exceed typical home connections. No server required for media (only signaling). Used for small groups (2-4) where server cost matters. SFU (Selective Forwarding Unit): each participant sends one upstream to the SFU server. The SFU forwards each participant’s stream to all other participants without decoding or re-encoding it — a pure forwarding operation. Each participant still downloads N-1 streams, but only uploads one. SFU can selectively forward: only send the active speaker’s video at full quality; downsample others. SFU also enables simulcast: the sender transmits 3 quality layers (720p, 360p, 180p); the SFU forwards the appropriate layer to each receiver based on their bandwidth. Modern standard: Discord, Twilio, Daily.co, Zoom (partly). MCU (Multipoint Control Unit): the server decodes all incoming streams, composites them into a single mixed video (like a video wall), re-encodes, and sends one stream to each participant. Participant downloads one stream regardless of N. Extremely CPU-intensive at the server (decode + encode × N). Scales to many viewers but high server cost. Legacy technology; most systems have migrated to SFU. Zoom used MCU historically; has been migrating toward SFU with active speaker detection.”
}
},
{
“@type”: “Question”,
“name”: “How does WebRTC handle network congestion and packet loss?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “WebRTC uses several mechanisms to handle unreliable networks in real time. Congestion control (GCC — Google Congestion Control): the receiver measures inter-packet arrival times. If packets arrive increasingly delayed (delay gradient increases), it signals network overload. The receiver sends REMB (Receiver Estimated Maximum Bitrate) messages or transport-cc feedback to the sender. The sender responds by reducing the video encoder bitrate target. In steady state, the sender gradually increases bitrate (5% every 1.5 seconds) until it detects overload again. This probes the available bandwidth continuously. FEC (Forward Error Correction): the sender transmits redundant data packets alongside the original. If a packet is lost, the receiver can reconstruct it from the FEC packets without a retransmission round trip. FEC adds overhead (5-10% extra bandwidth) but eliminates the 150-300ms retransmission delay for recovered packets. Best for networks with moderate random loss (not bursty). NACK (Negative Acknowledgment): the receiver detects a missing packet (sequence number gap) and sends a NACK to the sender, which retransmits the specific packet. Adds one round-trip of delay but is bandwidth-efficient. Best for bursty loss where the network recovers quickly. Jitter buffer: receivers buffer incoming packets to smooth out network jitter. A 150-200ms jitter buffer absorbs timing variations without audible/visible glitches. Larger buffers reduce glitches but increase latency. The jitter buffer adapts its size based on observed jitter. The combination: GCC prevents long-term overload, FEC handles occasional loss, NACK handles burst loss, jitter buffer handles timing variation.”
}
}
]
}

Companies That Ask This Question

Scroll to Top