Zoom handles 300+ million daily meeting participants with real-time audio and video. Designing a video conferencing system tests your understanding of real-time media transport (WebRTC), server architectures (SFU vs MCU), screen sharing, recording, and scaling to thousands of simultaneous meetings. This guide covers the architecture for a system design interview.
WebRTC and Real-Time Media
WebRTC (Web Real-Time Communication) is the standard for browser-based real-time audio and video. Three components: (1) Media capture — access the camera and microphone via browser APIs. (2) Peer connection — establish a direct or relayed media stream between participants. ICE (Interactive Connectivity Establishment) finds the best network path: direct peer-to-peer if possible, STUN (Session Traversal Utilities for NAT) to discover the public IP behind NAT, or TURN (Traversal Using Relays around NAT) to relay through a server when direct connection is impossible (corporate firewalls). (3) Signaling — exchange connection metadata (SDP offers/answers containing codec preferences, IP candidates) via a signaling server (WebSocket). WebRTC itself does not define signaling — the application implements it. Media transport: RTP (Real-time Transport Protocol) over UDP. UDP is preferred over TCP because: dropped packets are better than delayed packets for real-time media. A lost video frame causes a brief glitch; a retransmitted frame arrives too late to be useful. Codecs: VP8/VP9/AV1 for video, Opus for audio. Opus adapts bitrate dynamically (6-510 kbps) based on network conditions.
SFU vs MCU Architecture
For multi-party calls (3+ participants), pure peer-to-peer is impractical: each participant sends their stream to every other participant. With 10 participants: each sends 9 streams and receives 9 streams = 90 total streams. Bandwidth and CPU explode. Two server architectures solve this: SFU (Selective Forwarding Unit): each participant sends one stream to the SFU server. The SFU forwards each stream to all other participants. With 10 participants: each sends 1 stream, receives 9 streams. The SFU does no transcoding — it selects which streams and quality levels to forward based on each receiver bandwidth and layout. Low server CPU (just forwarding). Zoom, Google Meet, and most modern platforms use SFU. MCU (Multipoint Control Unit): each participant sends one stream to the MCU. The MCU decodes all streams, composites them into a single mixed video (grid layout), and sends one stream to each participant. Each participant receives 1 stream regardless of participant count. Low client bandwidth. High server CPU (decode + composite + encode per participant). Used for: low-bandwidth participants and telephone-bridged calls. Most platforms use SFU for the primary experience and MCU as a fallback for constrained clients.
Simulcast and Adaptive Quality
Participants have different network conditions. A user on fiber can receive 1080p; a user on cellular should receive 360p. Simulcast: each sender encodes their video at multiple quality levels simultaneously (e.g., 1080p, 720p, 360p) and sends all to the SFU. The SFU selects the appropriate quality for each receiver based on their bandwidth, screen size, and layout position. A user in the speaker spotlight view receives the speaker in 1080p and other participants in thumbnails at 360p. The SFU switches streams dynamically without any transcoding — just selects which simulcast layer to forward. SVC (Scalable Video Coding): the video is encoded in layers. The base layer is low quality; enhancement layers add quality. The SFU drops enhancement layers for bandwidth-constrained receivers. More efficient than simulcast (one encoding, not three) but more complex. VP9 SVC and AV1 SVC support this. Bandwidth estimation: each receiver continuously estimates available bandwidth (using RTCP feedback and Google Congestion Control algorithm). The SFU uses this estimate to select the appropriate quality level. If bandwidth drops mid-call, the SFU instantly switches to a lower layer.
Screen Sharing and Recording
Screen sharing: the browser captures the screen (or a specific window/tab) via the Screen Capture API. The captured frames are encoded as a video stream and sent to the SFU like a camera stream. Screen content is different from camera content: it has sharp text, low motion, and infrequent changes. Optimized encoding: use higher resolution (1080p+) with lower frame rate (5-15 fps instead of 30). Content-adaptive encoding detects when the screen is static (no changes) and reduces bitrate to near-zero. Recording: a recording bot joins the meeting as a participant. It receives all media streams from the SFU and: (1) Mixes audio from all participants into a single audio track. (2) Composites video into a grid layout (or active speaker view). (3) Encodes the mixed output as MP4 (H.264 video + AAC audio). (4) Uploads to S3 for storage. The recording can also be done server-side: the SFU sends all streams to a recording service that performs the mixing. Cloud recording is stored per meeting and available for download/playback after the meeting ends. For compliance (healthcare, finance): recordings may be required and encrypted at rest.
Scaling to Millions of Concurrent Meetings
Each meeting runs on a dedicated SFU instance (or a set of instances for large meetings). Scaling: (1) Meeting routing: when a user creates a meeting, the control plane assigns it to an SFU in the nearest region. A distributed meeting directory maps meeting_id -> SFU address. Participants connect to the assigned SFU. (2) Horizontal scaling: each SFU handles 50-200 simultaneous meetings (depending on participant count per meeting). Add more SFUs as demand grows. Kubernetes auto-scaling based on CPU utilization. (3) Large meetings (100+ participants): cascade multiple SFUs. The primary SFU receives media from speakers. Secondary SFUs subscribe to the primary and distribute to their group of viewers. This is a tree of SFUs. (4) Webinar mode (1000+ viewers): speakers send media to the SFU. Viewers receive via a CDN-like fan-out (HLS/DASH streaming for non-interactive viewers). Only panelists use WebRTC. (5) Global distribution: deploy SFUs in multiple regions. For cross-region meetings: participants connect to the nearest SFU. SFUs peer with each other to exchange streams across regions. This adds inter-region latency (~80ms US-Europe) but keeps each participant local network hop minimal. (6) Breakout rooms: create sub-meetings on separate SFU instances. Participants are moved between the main meeting and breakout SFUs. The host can broadcast messages to all breakout rooms.