System Design Interview: Video Conferencing (Zoom / Google Meet)

Q: What is the difference between SFU, MCU, and P2P in video conferencing?

P2P connects participants directly (best for 2-3 people, no server needed). MCU decodes and re-encodes all streams on the server into one composite (high server cost, constant bandwidth). SFU forwards streams selectively without re-encoding (most scalable, used by Zoom and Google Meet).

Q: What protocol does Zoom use for video transmission?

Zoom uses WebRTC with RTP/SRTP for media transport and DTLS for encryption. For signaling (SDP exchange and ICE candidate negotiation), they use their own WebSocket-based protocol. STUN/TURN servers handle NAT traversal.

Q: How does Zoom handle 1000-person calls?

Large calls use a hierarchical SFU tree: regional SFUs aggregate local participants and feed into a root SFU. For webinars, panelists use WebRTC while the audience receives HLS/DASH streams via CDN, dramatically reducing server load.

Q: What is the ICE protocol in WebRTC?

ICE (Interactive Connectivity Establishment) finds the best network path between peers. It tries direct connection first (host candidates), then discovers public IP via STUN, and falls back to relaying all traffic through a TURN server if direct connection is blocked by NAT or firewalls.

⏱ 8 min read

Designing a video conferencing system like Zoom or Google Meet is a complex system design problem that tests real-time media protocols, WebRTC architecture, signaling, and large-scale media routing. This question appears at Zoom, Google, Microsoft (Teams), and Twitch interviews.

Requirements Clarification

Scale: 300M daily meeting participants, 1M concurrent calls, up to 1000 participants per call
Latency: End-to-end audio/video latency <150ms for real-time feeling; <300ms tolerable
Quality: Adaptive bitrate based on network conditions; prioritize audio over video on poor connections
Features: HD video, screen sharing, chat, recording, live captions, breakout rooms
Reliability: Reconnect within 2s on transient failures; no data loss for chat/recording

Media Transport — P2P vs SFU vs MCU

"""
Three architectures for multi-party video:

P2P (Peer-to-Peer):
  Each participant sends to every other participant directly.
  Upload bandwidth: O(n-1) streams per participant.
  Pros: lowest latency, no server in media path, zero cost
  Cons: upload overwhelms connection at n > 4; each client decodes n-1 streams
  Use: 1:1 calls, small group (2-3 people)

MCU (Multipoint Control Unit):
  Server receives all streams, DECODES, mixes into single composite video,
  then RE-ENCODES and sends one stream to each participant.
  Each participant uploads 1 stream, downloads 1 stream.
  Pros: constant bandwidth regardless of participant count
  Cons: server does HUGE CPU work (decode + mix + encode); adds 200-400ms latency
  Use: PSTN/SIP gateway interop, legacy systems

SFU (Selective Forwarding Unit):
  Server receives all streams, FORWARDS (no decode/re-encode) selected streams.
  Each participant uploads 1 stream, downloads up to k streams (active speakers).
  Pros: server-side is lightweight (just routing); low latency; scalable
  Cons: each client decodes multiple streams; bandwidth = k * avg_bitrate
  Use: Zoom, Google Meet, Discord, Twitch (everything modern)

Zoom actually uses a hybrid:
  - Small meetings: pure SFU
  - Large meetings (500+): hierarchical SFU tree (edge SFUs → root SFU)
  - Webinars: one-to-many SFU + CDN for recording delivery
"""

WebRTC and Signaling

"""
WebRTC: browser/native API for real-time media.
  Key components:
    - RTCPeerConnection: manages ICE, DTLS, RTP/RTCP
    - getUserMedia: capture camera/microphone
    - RTCDataChannel: arbitrary data (chat, file transfer)

ICE (Interactive Connectivity Establishment):
  Finds the best network path between peers:
  1. Host candidates:    local IP addresses
  2. Server-reflexive:  public IP discovered via STUN server
  3. Relay candidates:  traffic via TURN server (fallback)

STUN (Session Traversal Utilities for NAT):
  Client asks STUN server "what is my public IP?"
  STUN is cheap — just lookup, no media traffic.

TURN (Traversal Using Relays around NAT):
  When direct connection fails (symmetric NAT, firewall),
  TURN server relays ALL media traffic.
  Expensive: CPU + bandwidth cost on TURN server.
  Only ~10-15% of calls actually need TURN.

Signaling (WebRTC itself has no signaling protocol):
  Exchange of SDP (Session Description Protocol) offers/answers
  and ICE candidates via your signaling server.
  Transport: WebSocket (low latency), or HTTP long-poll.
"""

import asyncio
import json
import websockets

class SignalingServer:
    def __init__(self):
        self.rooms = {}     # room_id -> {peer_id -> websocket}

    async def handle_client(self, websocket, path):
        peer_id = None
        room_id = None
        try:
            async for raw in websocket:
                msg = json.loads(raw)
                msg_type = msg.get("type")

                if msg_type == "join":
                    peer_id = msg["peer_id"]
                    room_id = msg["room_id"]
                    if room_id not in self.rooms:
                        self.rooms[room_id] = {}
                    self.rooms[room_id][peer_id] = websocket

                    # Notify existing peers
                    for pid, ws in self.rooms[room_id].items():
                        if pid != peer_id:
                            await ws.send(json.dumps({
                                "type": "peer_joined", "peer_id": peer_id
                            }))

                elif msg_type in ("offer", "answer", "ice_candidate"):
                    # Forward signaling message to target peer
                    target_id = msg.get("target_peer_id")
                    room = self.rooms.get(room_id, {})
                    target_ws = room.get(target_id)
                    if target_ws:
                        await target_ws.send(json.dumps({
                            **msg, "from_peer_id": peer_id
                        }))

        finally:
            if room_id and peer_id:
                self.rooms.get(room_id, {}).pop(peer_id, None)
                # Notify room of departure
                for ws in self.rooms.get(room_id, {}).values():
                    await ws.send(json.dumps({
                        "type": "peer_left", "peer_id": peer_id
                    }))

SFU Media Server Architecture

from dataclasses import dataclass, field
from typing import Dict, Set, Optional, List
import asyncio

@dataclass
class MediaTrack:
    track_id:    str
    kind:        str   # "audio" | "video" | "screen"
    peer_id:     str
    room_id:     str
    simulcast_layers: List[str] = field(default_factory=lambda: ["high", "medium", "low"])

@dataclass
class Participant:
    peer_id:         str
    room_id:         str
    published_tracks: Dict[str, MediaTrack] = field(default_factory=dict)
    subscribed_tracks: Set[str] = field(default_factory=set)
    network_score:   float = 1.0   # 0-1: network quality estimate
    is_speaking:     bool = False

class SFURouter:
    """
    SFU: selectively forwards tracks from publishers to subscribers.
    Key optimizations:
      1. Simulcast: publishers send 3 quality layers; SFU sends appropriate layer
      2. Active speaker detection: only forward N most recent speakers
      3. Bandwidth estimation: adjust subscribed layer per receiver quality
    """
    MAX_VISIBLE_VIDEO_STREAMS = 25  # Cap to avoid client CPU overload

    def __init__(self):
        self.rooms: Dict[str, Dict[str, Participant]] = {}

    def get_subscription_layers(self, subscriber: Participant, track: MediaTrack) -> str:
        """Choose which simulcast layer to forward based on subscriber quality."""
        if subscriber.network_score > 0.8:
            return "high"
        elif subscriber.network_score > 0.5:
            return "medium"
        return "low"

    def get_active_speakers(self, room_id: str, n: int = 5) -> List[str]:
        """Return peer IDs of n most recent active speakers."""
        room = self.rooms.get(room_id, {})
        return [
            pid for pid, p in room.items() if p.is_speaking
        ][:n]

    def compute_subscription_plan(self, room_id: str, subscriber_id: str) -> dict:
        """
        Determine which tracks subscriber should receive.
        Audio: subscribe to all (audio is cheap)
        Video: subscribe to active speakers + pinned + up to MAX_VISIBLE_STREAMS
        """
        room = self.rooms.get(room_id, {})
        subscriber = room.get(subscriber_id)
        if not subscriber:
            return {}

        plan = {}
        active_speakers = set(self.get_active_speakers(room_id))
        video_count = 0

        for peer_id, participant in room.items():
            if peer_id == subscriber_id:
                continue
            for track_id, track in participant.published_tracks.items():
                if track.kind == "audio":
                    plan[track_id] = "high"  # Always subscribe to audio
                elif track.kind in ("video", "screen"):
                    if video_count >= self.MAX_VISIBLE_VIDEO_STREAMS:
                        continue
                    if peer_id in active_speakers or track.kind == "screen":
                        layer = self.get_subscription_layers(subscriber, track)
                        plan[track_id] = layer
                        video_count += 1

        return plan

Recording Architecture

"""
Recording options:

Client-side recording:
  MediaRecorder API in browser → upload chunks to S3.
  Pros: no server in media path.
  Cons: fails if browser crashes; limited format support.

Server-side bot recording:
  Headless Chrome/Electron bot joins as a fake participant.
  Bot records WebRTC output, muxes to MP4, uploads to S3.
  Pros: reliable; captures composited view.
  Cons: expensive (full browser instance per recording).

SFU-side recording:
  SFU saves incoming RTP streams to disk.
  Post-process: mux separate audio/video tracks into MP4.
  Pros: most efficient; no additional participants.
  Cons: complex post-processing; requires per-track sync.

Pipeline:
  SFU → save RTP dumps → Kafka → transcoding workers (FFmpeg)
       → MP4 output → S3 → CDN delivery link sent to participants
"""

Handling Large-Scale Calls (1000+ Participants)

"""
Strategies for very large calls (webinars, all-hands):

1. Hierarchical SFU tree
   Region SFU (receives from local participants)
     → Root SFU (aggregates for cross-region delivery)
   Reduces transcontinental hops for each stream.

2. One-to-many delivery via CDN
   Panelists: WebRTC (low latency, bidirectional)
   Audience:  HLS/DASH via CDN (5-30s latency, scales to millions)
   This is how YouTube Live and Twitch work.

3. Active speaker video switching
   At 1000 participants, show only 1-4 active speakers.
   Everyone else: audio only or thumbnail.
   Server detects speaking via audio level from RTP header.

4. Selective subscription
   Each viewer subscribes to at most k video streams.
   Server decides which k to send based on screen layout.

5. Rate limiting and QoS
   Prioritize audio packets over video (audio < 50kbps).
   Video: use FEC (Forward Error Correction) for packet loss.
   At congestion, drop video frames before dropping audio.
"""

Key Design Decisions

Decision	Choice	Rationale
Media architecture	SFU	Scales to 100 participants; server-side just routes
Protocol	WebRTC (SRTP/DTLS)	Browser-native, encrypted, ICE NAT traversal
Signaling	WebSocket	Low-latency bidirectional; needed for ICE candidates
Video quality	Simulcast + bandwidth estimation	Adaptive to each receiver network quality
Large calls	HLS for audience + WebRTC for panelists	CDN scales to millions; WebRTC for interactive
Recording	SFU-side RTP dump + transcoding	Most efficient; no extra client/server overhead
Region routing	GeoDNS to nearest media gateway	Minimize latency; ICE finds optimal path within region

Companies That Ask This System Design Question

This problem commonly appears in interviews at:

See our company interview guides for full interview process, compensation, and preparation tips.

Frequently Asked Questions

What is the difference between SFU, MCU, and P2P in video conferencing?

P2P connects participants directly (best for 2-3 people, no server needed). MCU decodes and re-encodes all streams on the server into one composite (high server cost, constant bandwidth). SFU forwards streams selectively without re-encoding (most scalable, used by Zoom and Google Meet).

What protocol does Zoom use for video transmission?

Zoom uses WebRTC with RTP/SRTP for media transport and DTLS for encryption. For signaling (SDP exchange and ICE candidate negotiation), they use their own WebSocket-based protocol. STUN/TURN servers handle NAT traversal.

How does Zoom handle 1000-person calls?

Large calls use a hierarchical SFU tree: regional SFUs aggregate local participants and feed into a root SFU. For webinars, panelists use WebRTC while the audience receives HLS/DASH streams via CDN, dramatically reducing server load.

What is the ICE protocol in WebRTC?

ICE (Interactive Connectivity Establishment) finds the best network path between peers. It tries direct connection first (host candidates), then discovers public IP via STUN, and falls back to relaying all traffic through a TURN server if direct connection is blocked by NAT or firewalls.