WebRTC Video Calling System: Low-Level Design

⏱ 4 min read

A video calling system enables real-time audio and video communication between users in a browser or mobile app. WebRTC (Web Real-Time Communication) is the browser-native protocol that enables peer-to-peer media streaming — Zoom, Google Meet, and Discord all use WebRTC or derivatives. The design must address: peer connection establishment (signaling, NAT traversal), media encoding and transmission, scalability for group calls, and handling network quality degradation.

Signaling and Connection Establishment

WebRTC peers cannot connect directly without first exchanging connection information through a signaling channel. Signaling flow: (1) Caller generates an SDP Offer (Session Description Protocol) — a text description of its media capabilities (supported codecs, resolutions, bitrates) and ICE candidates (network addresses). (2) Caller sends the offer to the signaling server (via WebSocket). (3) Signaling server forwards the offer to the callee. (4) Callee generates an SDP Answer with its own capabilities and ICE candidates. (5) Callee sends the answer back through the signaling server. (6) Both sides use their ICE candidates to establish a direct peer-to-peer connection. The signaling server is a simple message relay — it doesn’t process media. It only carries SDP and ICE candidate messages during connection setup. After connection is established, media flows peer-to-peer, bypassing the signaling server. Signaling server technology: WebSocket server for low-latency bidirectional messaging. Stateful (must maintain active user connections); scale horizontally with Redis Pub/Sub for cross-server signaling.

NAT Traversal: STUN and TURN

Most devices are behind NAT (network address translation) — they have a private IP that isn’t reachable from the internet. WebRTC must discover routable addresses and punch through NAT. STUN (Session Traversal Utilities for NAT): a lightweight protocol that tells a device its public IP and port as seen from the internet. The device sends a STUN request to a STUN server (e.g., stun.l.google.com:19302); the server replies with the device’s public IP:port. This public address is included as an ICE candidate. STUN works for ~80% of NAT types (symmetric NAT blocks STUN). TURN (Traversal Using Relays around NAT): when STUN fails (both peers are behind symmetric NAT), all media is relayed through a TURN server. TURN receives media from peer A and forwards it to peer B — and vice versa. TURN is expensive (server must handle all media bandwidth). Infrastructure: host TURN servers in geographically distributed locations (one per region). A call between a Tokyo user and a Seoul user routes through a Tokyo or Seoul TURN server. ICE (Interactive Connectivity Establishment): the framework that tries all candidate pairs (direct, STUN, TURN) and selects the best working path. ICE candidate gathering and checking completes in 1-3 seconds.

Media Encoding and Adaptive Bitrate

Video is encoded and transmitted over UDP (not TCP) — UDP allows the browser to drop late packets rather than waiting for retransmission, which would cause visible freezing. Codecs: VP8, VP9, H.264, AV1 for video. Opus for audio (designed for real-time communication — excellent at low bitrates). Adaptive bitrate: WebRTC monitors network conditions (packet loss, round-trip time, available bandwidth) and adjusts encoding parameters in real time. If packet loss rises: reduce resolution (1080p → 720p → 480p → 360p) or frame rate. If bandwidth improves: increase quality. WebRTC’s congestion control algorithm (Google’s REMB — Receiver Estimated Maximum Bitrate, or newer Transport-CC) sends feedback to the sender about available bandwidth. The sender adapts its encoding bitrate to match. Simulcast: the sender encodes multiple streams at different resolutions simultaneously (720p, 360p, 180p). The SFU selects which stream to forward to each recipient based on their network capacity. A mobile user on a weak connection receives the 180p stream; a desktop user receives 720p.

Selective Forwarding Unit (SFU) for Group Calls

Peer-to-peer WebRTC works for 1:1 calls. For group calls with N participants: mesh topology (everyone connects to everyone) requires N*(N-1)/2 connections and N-1 upload streams per participant — impractical beyond 4-5 people. SFU (Selective Forwarding Unit): a media server that receives one stream per participant and forwards (not decodes) the appropriate stream to each other participant. Each participant uploads one stream to the SFU; the SFU forwards it to all others. With an SFU: N participants each upload 1 stream and download N-1 streams from the SFU. The SFU doesn’t decode or mix — it selectively forwards packets, adding very low latency (< 5ms). MCU (Multipoint Control Unit): an alternative where the server decodes all streams, composites them into one (grid layout), and re-encodes a single stream to each participant. Simpler for receivers (one stream) but high server CPU cost and encoding latency. Zoom uses SFU architecture with a proprietary protocol. Google Meet uses SFU. SFU scaling: one SFU server handles ~100 concurrent calls. Scale horizontally — a call is assigned to one SFU server (or a cluster for very large calls). A cascade SFU connects multiple SFU instances for thousands of participants (webinars).

Recording and Media Storage

Recording a call requires capturing media streams on a server. The SFU can forward media to a recording service alongside routing to participants. Recording pipeline: the SFU sends RTP packets to a recorder process. The recorder writes raw RTP to disk (temporary). After the call: a transcoder (FFmpeg) converts the raw RTP stream to MP4 (H.264 + AAC), mixed to a single track. The MP4 is uploaded to object storage (S3). Participants receive a download link. Server-side recording latency: the call must finish before the recording is available (transcoding takes ~0.5x call duration). Real-time composite recording: alternatively, process media in real time during the call — render a composite view (grid of participant videos) as a single MP4 stream. More complex and higher CPU cost but the recording is available immediately after the call. Storage: a 1-hour 720p group call generates ~500MB-1GB of video. At 100,000 recorded calls/day: ~50-100TB of new storage per day. Use tiered storage — recent recordings on standard S3, older recordings moved to Glacier (cold storage) after 30 days.