Question 1

What is the difference between SFU and MCU for video conferencing?

Accepted Answer

SFU (Selective Forwarding Unit): each participant sends ONE stream to the SFU. The SFU forwards each stream to all other participants without transcoding. With 10 participants: each sends 1, receives 9. Low server CPU (just forwarding packets). Client must decode multiple streams. The SFU selects which quality level to forward per receiver based on bandwidth. Used by Zoom, Google Meet, most modern platforms. MCU (Multipoint Control Unit): each participant sends ONE stream to the MCU. The MCU decodes ALL streams, composites them into a single mixed video (grid layout), encodes, and sends ONE stream to each participant. Each receives 1 stream regardless of count. Low client bandwidth and CPU. Very high server CPU. Used for low-bandwidth clients and telephone bridges. Most platforms use SFU as primary with MCU as fallback for constrained clients.

Question 2

How does simulcast enable adaptive quality in video calls?

Accepted Answer

Participants have different network conditions. Simulcast: each sender encodes video at multiple quality levels simultaneously (e.g., 1080p, 720p, 360p) and sends all to the SFU. The SFU selects the appropriate quality for each receiver based on their bandwidth, screen size, and layout position. The active speaker in spotlight gets 1080p; thumbnail participants get 360p. The SFU switches layers instantly without transcoding -- just selects which simulcast layer to forward. SVC (Scalable Video Coding) is more efficient: one encoding with base + enhancement layers. The SFU drops enhancement layers for constrained receivers. VP9 SVC and AV1 SVC support this. Bandwidth estimation: receivers report available bandwidth via RTCP. The SFU uses this to select quality. If bandwidth drops mid-call, the switch to a lower layer is instant -- no re-encoding needed.

Question 3

How does Zoom scale to millions of concurrent meetings?

Accepted Answer

Each meeting runs on a dedicated SFU instance. Scaling layers: (1) Meeting routing: control plane assigns each meeting to an SFU in the nearest region. A directory maps meeting_id -> SFU address. (2) Horizontal scaling: each SFU handles 50-200 meetings. Auto-scale with Kubernetes based on CPU. (3) Large meetings (100+): cascade SFUs in a tree. Primary receives speaker media, secondaries distribute to viewer groups. (4) Webinars (1000+): speakers use WebRTC to the SFU. Viewers receive via CDN streaming (HLS/DASH) -- non-interactive but massively scalable. (5) Global: SFUs in multiple regions. Cross-region meetings: each participant connects to nearest SFU, SFUs peer to exchange streams. Adds inter-region latency (~80ms US-EU) but keeps local hops fast. (6) Breakout rooms: sub-meetings on separate SFUs with participant transfer capability.

Question 4

How does screen sharing work differently from camera video in WebRTC?

Accepted Answer

Screen sharing captures the desktop via the Screen Capture API and sends it as a video stream through the SFU like a camera. But screen content is different: sharp text, low motion, infrequent changes. Optimization: use higher resolution (1080p+) with lower frame rate (5-15 fps instead of 30 fps for camera). Content-adaptive encoding detects static frames (no changes) and reduces bitrate to near-zero -- critical for presentations where slides stay visible for minutes. This dramatically reduces bandwidth compared to encoding every frame. The SFU treats the screen share as another video stream but may prioritize its bandwidth allocation (screen content with text needs higher resolution to remain readable). Participants see the screen share in a larger view with the speaker camera in a smaller overlay.

System Design: Design Zoom — Video Conferencing, WebRTC, SFU/MCU, Screen Sharing, Recording, Breakout Rooms

WebRTC and Real-Time Media

SFU vs MCU Architecture

Simulcast and Adaptive Quality

Scaling to Millions of Concurrent Meetings

WebRTC and Real-Time Media

SFU vs MCU Architecture

Simulcast and Adaptive Quality

Screen Sharing and Recording

Scaling to Millions of Concurrent Meetings