Designing a video conferencing system like Zoom or Google Meet is a complex system design problem that tests real-time media protocols, WebRTC architecture, signaling, and large-scale media routing. This question appears at Zoom, Google, Microsoft (Teams), and Twitch interviews.
Requirements Clarification
- Scale: 300M daily meeting participants, 1M concurrent calls, up to 1000 participants per call
- Latency: End-to-end audio/video latency <150ms for real-time feeling; <300ms tolerable
- Quality: Adaptive bitrate based on network conditions; prioritize audio over video on poor connections
- Features: HD video, screen sharing, chat, recording, live captions, breakout rooms
- Reliability: Reconnect within 2s on transient failures; no data loss for chat/recording
Media Transport — P2P vs SFU vs MCU
"""
Three architectures for multi-party video:
P2P (Peer-to-Peer):
Each participant sends to every other participant directly.
Upload bandwidth: O(n-1) streams per participant.
Pros: lowest latency, no server in media path, zero cost
Cons: upload overwhelms connection at n > 4; each client decodes n-1 streams
Use: 1:1 calls, small group (2-3 people)
MCU (Multipoint Control Unit):
Server receives all streams, DECODES, mixes into single composite video,
then RE-ENCODES and sends one stream to each participant.
Each participant uploads 1 stream, downloads 1 stream.
Pros: constant bandwidth regardless of participant count
Cons: server does HUGE CPU work (decode + mix + encode); adds 200-400ms latency
Use: PSTN/SIP gateway interop, legacy systems
SFU (Selective Forwarding Unit):
Server receives all streams, FORWARDS (no decode/re-encode) selected streams.
Each participant uploads 1 stream, downloads up to k streams (active speakers).
Pros: server-side is lightweight (just routing); low latency; scalable
Cons: each client decodes multiple streams; bandwidth = k * avg_bitrate
Use: Zoom, Google Meet, Discord, Twitch (everything modern)
Zoom actually uses a hybrid:
- Small meetings: pure SFU
- Large meetings (500+): hierarchical SFU tree (edge SFUs → root SFU)
- Webinars: one-to-many SFU + CDN for recording delivery
"""
WebRTC and Signaling
"""
WebRTC: browser/native API for real-time media.
Key components:
- RTCPeerConnection: manages ICE, DTLS, RTP/RTCP
- getUserMedia: capture camera/microphone
- RTCDataChannel: arbitrary data (chat, file transfer)
ICE (Interactive Connectivity Establishment):
Finds the best network path between peers:
1. Host candidates: local IP addresses
2. Server-reflexive: public IP discovered via STUN server
3. Relay candidates: traffic via TURN server (fallback)
STUN (Session Traversal Utilities for NAT):
Client asks STUN server "what is my public IP?"
STUN is cheap — just lookup, no media traffic.
TURN (Traversal Using Relays around NAT):
When direct connection fails (symmetric NAT, firewall),
TURN server relays ALL media traffic.
Expensive: CPU + bandwidth cost on TURN server.
Only ~10-15% of calls actually need TURN.
Signaling (WebRTC itself has no signaling protocol):
Exchange of SDP (Session Description Protocol) offers/answers
and ICE candidates via your signaling server.
Transport: WebSocket (low latency), or HTTP long-poll.
"""
import asyncio
import json
import websockets
class SignalingServer:
def __init__(self):
self.rooms = {} # room_id -> {peer_id -> websocket}
async def handle_client(self, websocket, path):
peer_id = None
room_id = None
try:
async for raw in websocket:
msg = json.loads(raw)
msg_type = msg.get("type")
if msg_type == "join":
peer_id = msg["peer_id"]
room_id = msg["room_id"]
if room_id not in self.rooms:
self.rooms[room_id] = {}
self.rooms[room_id][peer_id] = websocket
# Notify existing peers
for pid, ws in self.rooms[room_id].items():
if pid != peer_id:
await ws.send(json.dumps({
"type": "peer_joined", "peer_id": peer_id
}))
elif msg_type in ("offer", "answer", "ice_candidate"):
# Forward signaling message to target peer
target_id = msg.get("target_peer_id")
room = self.rooms.get(room_id, {})
target_ws = room.get(target_id)
if target_ws:
await target_ws.send(json.dumps({
**msg, "from_peer_id": peer_id
}))
finally:
if room_id and peer_id:
self.rooms.get(room_id, {}).pop(peer_id, None)
# Notify room of departure
for ws in self.rooms.get(room_id, {}).values():
await ws.send(json.dumps({
"type": "peer_left", "peer_id": peer_id
}))
SFU Media Server Architecture
from dataclasses import dataclass, field
from typing import Dict, Set, Optional, List
import asyncio
@dataclass
class MediaTrack:
track_id: str
kind: str # "audio" | "video" | "screen"
peer_id: str
room_id: str
simulcast_layers: List[str] = field(default_factory=lambda: ["high", "medium", "low"])
@dataclass
class Participant:
peer_id: str
room_id: str
published_tracks: Dict[str, MediaTrack] = field(default_factory=dict)
subscribed_tracks: Set[str] = field(default_factory=set)
network_score: float = 1.0 # 0-1: network quality estimate
is_speaking: bool = False
class SFURouter:
"""
SFU: selectively forwards tracks from publishers to subscribers.
Key optimizations:
1. Simulcast: publishers send 3 quality layers; SFU sends appropriate layer
2. Active speaker detection: only forward N most recent speakers
3. Bandwidth estimation: adjust subscribed layer per receiver quality
"""
MAX_VISIBLE_VIDEO_STREAMS = 25 # Cap to avoid client CPU overload
def __init__(self):
self.rooms: Dict[str, Dict[str, Participant]] = {}
def get_subscription_layers(self, subscriber: Participant, track: MediaTrack) -> str:
"""Choose which simulcast layer to forward based on subscriber quality."""
if subscriber.network_score > 0.8:
return "high"
elif subscriber.network_score > 0.5:
return "medium"
return "low"
def get_active_speakers(self, room_id: str, n: int = 5) -> List[str]:
"""Return peer IDs of n most recent active speakers."""
room = self.rooms.get(room_id, {})
return [
pid for pid, p in room.items() if p.is_speaking
][:n]
def compute_subscription_plan(self, room_id: str, subscriber_id: str) -> dict:
"""
Determine which tracks subscriber should receive.
Audio: subscribe to all (audio is cheap)
Video: subscribe to active speakers + pinned + up to MAX_VISIBLE_STREAMS
"""
room = self.rooms.get(room_id, {})
subscriber = room.get(subscriber_id)
if not subscriber:
return {}
plan = {}
active_speakers = set(self.get_active_speakers(room_id))
video_count = 0
for peer_id, participant in room.items():
if peer_id == subscriber_id:
continue
for track_id, track in participant.published_tracks.items():
if track.kind == "audio":
plan[track_id] = "high" # Always subscribe to audio
elif track.kind in ("video", "screen"):
if video_count >= self.MAX_VISIBLE_VIDEO_STREAMS:
continue
if peer_id in active_speakers or track.kind == "screen":
layer = self.get_subscription_layers(subscriber, track)
plan[track_id] = layer
video_count += 1
return plan
Recording Architecture
"""
Recording options:
Client-side recording:
MediaRecorder API in browser → upload chunks to S3.
Pros: no server in media path.
Cons: fails if browser crashes; limited format support.
Server-side bot recording:
Headless Chrome/Electron bot joins as a fake participant.
Bot records WebRTC output, muxes to MP4, uploads to S3.
Pros: reliable; captures composited view.
Cons: expensive (full browser instance per recording).
SFU-side recording:
SFU saves incoming RTP streams to disk.
Post-process: mux separate audio/video tracks into MP4.
Pros: most efficient; no additional participants.
Cons: complex post-processing; requires per-track sync.
Pipeline:
SFU → save RTP dumps → Kafka → transcoding workers (FFmpeg)
→ MP4 output → S3 → CDN delivery link sent to participants
"""
Handling Large-Scale Calls (1000+ Participants)
"""
Strategies for very large calls (webinars, all-hands):
1. Hierarchical SFU tree
Region SFU (receives from local participants)
→ Root SFU (aggregates for cross-region delivery)
Reduces transcontinental hops for each stream.
2. One-to-many delivery via CDN
Panelists: WebRTC (low latency, bidirectional)
Audience: HLS/DASH via CDN (5-30s latency, scales to millions)
This is how YouTube Live and Twitch work.
3. Active speaker video switching
At 1000 participants, show only 1-4 active speakers.
Everyone else: audio only or thumbnail.
Server detects speaking via audio level from RTP header.
4. Selective subscription
Each viewer subscribes to at most k video streams.
Server decides which k to send based on screen layout.
5. Rate limiting and QoS
Prioritize audio packets over video (audio < 50kbps).
Video: use FEC (Forward Error Correction) for packet loss.
At congestion, drop video frames before dropping audio.
"""
Key Design Decisions
| Decision | Choice | Rationale |
|---|---|---|
| Media architecture | SFU | Scales to 100 participants; server-side just routes |
| Protocol | WebRTC (SRTP/DTLS) | Browser-native, encrypted, ICE NAT traversal |
| Signaling | WebSocket | Low-latency bidirectional; needed for ICE candidates |
| Video quality | Simulcast + bandwidth estimation | Adaptive to each receiver network quality |
| Large calls | HLS for audience + WebRTC for panelists | CDN scales to millions; WebRTC for interactive |
| Recording | SFU-side RTP dump + transcoding | Most efficient; no extra client/server overhead |
| Region routing | GeoDNS to nearest media gateway | Minimize latency; ICE finds optimal path within region |
Companies That Ask This System Design Question
This problem commonly appears in interviews at:
See our company interview guides for full interview process, compensation, and preparation tips.
Frequently Asked Questions
What is the difference between SFU, MCU, and P2P in video conferencing?
P2P connects participants directly (best for 2-3 people, no server needed). MCU decodes and re-encodes all streams on the server into one composite (high server cost, constant bandwidth). SFU forwards streams selectively without re-encoding (most scalable, used by Zoom and Google Meet).
What protocol does Zoom use for video transmission?
Zoom uses WebRTC with RTP/SRTP for media transport and DTLS for encryption. For signaling (SDP exchange and ICE candidate negotiation), they use their own WebSocket-based protocol. STUN/TURN servers handle NAT traversal.
How does Zoom handle 1000-person calls?
Large calls use a hierarchical SFU tree: regional SFUs aggregate local participants and feed into a root SFU. For webinars, panelists use WebRTC while the audience receives HLS/DASH streams via CDN, dramatically reducing server load.
What is the ICE protocol in WebRTC?
ICE (Interactive Connectivity Establishment) finds the best network path between peers. It tries direct connection first (host candidates), then discovers public IP via STUN, and falls back to relaying all traffic through a TURN server if direct connection is blocked by NAT or firewalls.
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What is the difference between SFU, MCU, and P2P in video conferencing?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “P2P connects participants directly (best for 2-3 people, no server needed). MCU decodes and re-encodes all streams on the server into one composite (high server cost, constant bandwidth). SFU forwards streams selectively without re-encoding (most scalable, used by Zoom and Google Meet).”
}
},
{
“@type”: “Question”,
“name”: “What protocol does Zoom use for video transmission?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Zoom uses WebRTC with RTP/SRTP for media transport and DTLS for encryption. For signaling (SDP exchange and ICE candidate negotiation), they use their own WebSocket-based protocol. STUN/TURN servers handle NAT traversal.”
}
},
{
“@type”: “Question”,
“name”: “How does Zoom handle 1000-person calls?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Large calls use a hierarchical SFU tree: regional SFUs aggregate local participants and feed into a root SFU. For webinars, panelists use WebRTC while the audience receives HLS/DASH streams via CDN, dramatically reducing server load.”
}
},
{
“@type”: “Question”,
“name”: “What is the ICE protocol in WebRTC?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “ICE (Interactive Connectivity Establishment) finds the best network path between peers. It tries direct connection first (host candidates), then discovers public IP via STUN, and falls back to relaying all traffic through a TURN server if direct connection is blocked by NAT or firewalls.”
}
}
]
}