Question 1

How does WebRTC establish a peer-to-peer connection through NAT?

Accepted Answer

Most devices are behind NAT (Network Address Translation) — they have a private IP (192.168.x.x) on their local network and share a public IP with many other devices. Two NAT'd devices cannot connect directly without help because neither knows the other's public IP and port. WebRTC uses ICE (Interactive Connectivity Establishment) with STUN and TURN servers to solve this. STUN (Session Traversal Utilities for NAT): each peer sends a UDP packet to a STUN server. The STUN server responds with the peer's public IP and port as observed from the internet — this is the server-reflexive candidate. If both peers are behind simple (full-cone or port-restricted) NAT, they can connect peer-to-peer by exchanging these public addresses via the signaling server. Signaling: the exchange of ICE candidates and SDP (Session Description Protocol) offers/answers happens via a WebSocket-based signaling server. The signaling server is only needed for setup — once the peer connection is established, all media flows peer-to-peer. TURN (Traversal Using Relays around NAT): when direct peer-to-peer fails (symmetric NAT, corporate firewall that blocks UDP), both peers connect to a TURN server which relays all media between them. TURN adds latency and costs bandwidth but works in all network conditions. ICE negotiation tries all possible candidate pairs (host-to-host, STUN-to-STUN, TURN-to-TURN) and selects the lowest-latency successful path. Roughly 10-30% of WebRTC sessions require TURN relay.

Question 2

What is the difference between SFU, MCU, and mesh topology for group video calls?

Accepted Answer

These three topologies differ in where media is mixed and how much bandwidth/compute each endpoint and server uses. Mesh: every participant establishes a direct WebRTC peer connection with every other participant. N participants → N(N-1)/2 connections; each participant uploads N-1 streams simultaneously. For 6 people, each uploads 5 streams (video at 1 Mbps each = 5 Mbps upload required). At 10+ participants, bandwidth requirements exceed typical home connections. No server required for media (only signaling). Used for small groups (2-4) where server cost matters. SFU (Selective Forwarding Unit): each participant sends one upstream to the SFU server. The SFU forwards each participant's stream to all other participants without decoding or re-encoding it — a pure forwarding operation. Each participant still downloads N-1 streams, but only uploads one. SFU can selectively forward: only send the active speaker's video at full quality; downsample others. SFU also enables simulcast: the sender transmits 3 quality layers (720p, 360p, 180p); the SFU forwards the appropriate layer to each receiver based on their bandwidth. Modern standard: Discord, Twilio, Daily.co, Zoom (partly). MCU (Multipoint Control Unit): the server decodes all incoming streams, composites them into a single mixed video (like a video wall), re-encodes, and sends one stream to each participant. Participant downloads one stream regardless of N. Extremely CPU-intensive at the server (decode + encode × N). Scales to many viewers but high server cost. Legacy technology; most systems have migrated to SFU. Zoom used MCU historically; has been migrating toward SFU with active speaker detection.

Question 3

How does WebRTC handle network congestion and packet loss?

Accepted Answer

WebRTC uses several mechanisms to handle unreliable networks in real time. Congestion control (GCC — Google Congestion Control): the receiver measures inter-packet arrival times. If packets arrive increasingly delayed (delay gradient increases), it signals network overload. The receiver sends REMB (Receiver Estimated Maximum Bitrate) messages or transport-cc feedback to the sender. The sender responds by reducing the video encoder bitrate target. In steady state, the sender gradually increases bitrate (5% every 1.5 seconds) until it detects overload again. This probes the available bandwidth continuously. FEC (Forward Error Correction): the sender transmits redundant data packets alongside the original. If a packet is lost, the receiver can reconstruct it from the FEC packets without a retransmission round trip. FEC adds overhead (5-10% extra bandwidth) but eliminates the 150-300ms retransmission delay for recovered packets. Best for networks with moderate random loss (not bursty). NACK (Negative Acknowledgment): the receiver detects a missing packet (sequence number gap) and sends a NACK to the sender, which retransmits the specific packet. Adds one round-trip of delay but is bandwidth-efficient. Best for bursty loss where the network recovers quickly. Jitter buffer: receivers buffer incoming packets to smooth out network jitter. A 150-200ms jitter buffer absorbs timing variations without audible/visible glitches. Larger buffers reduce glitches but increase latency. The jitter buffer adapts its size based on observed jitter. The combination: GCC prevents long-term overload, FEC handles occasional loss, NACK handles burst loss, jitter buffer handles timing variation.

System Design Interview: WebRTC and Real-Time Video Architecture

WebRTC Architecture Overview

WebRTC Core Components

NAT Traversal: STUN and TURN

Signaling Server

Topology: Mesh vs SFU vs MCU

Codec Selection

Adaptive Bitrate and Congestion Control

Scaling SFU Infrastructure

Interview Questions

Frequently Asked Questions

How does WebRTC establish a peer-to-peer connection through NAT?

What is the difference between SFU, MCU, and mesh topology for group video calls?

How does WebRTC handle network congestion and packet loss?

Companies That Ask This Question