Design Mobile Video Conferencing: Zoom, FaceTime, and WebRTC

Mobile video conferencing is one of the most demanding mobile system design problems. The interview tests whether you can design a system that handles real-time audio/video over unreliable networks, balances bandwidth across participants, and keeps the call going through cellular handoffs and screen locks.

Functional requirements

  • 1:1 and group video calls (up to 50+ participants)
  • Audio and video streams
  • Screen sharing
  • Chat sidebar
  • Mute, camera toggle, virtual backgrounds
  • Background mode (audio-only)

Non-functional

  • Sub-200ms p99 audio latency
  • Smooth video at variable bitrates
  • Low battery drain — 1 hour call should not drain more than 15-20% of battery
  • Survives network changes (WiFi to LTE handoff)

Architecture

Two main models:

  • P2P (peer-to-peer): direct connection between two devices. WebRTC handles the signaling and media. Used for 1:1 (FaceTime).
  • SFU (Selective Forwarding Unit): all clients send/receive through a central server which forwards streams. Used for groups (Zoom).

Codecs

Audio: Opus (industry standard, 6–510 kbps adaptive).

Video: H.264 (universal compatibility), VP9, or AV1 (better compression, more CPU-expensive).

Modern apps negotiate the best codec available on both ends. Simulcast (sending multiple resolutions) is common for groups so the SFU can pick what to forward to each viewer based on their downlink.

Adaptive bitrate

WebRTC continuously measures network conditions:

  • Packet loss
  • RTT (round-trip time)
  • Available bandwidth

Adjusts encoder bitrate accordingly. If bandwidth drops, video resolution drops first; audio is preserved at all costs.

Network handoff

If the user moves from WiFi to LTE (or vice versa), the IP address changes and the connection breaks unless ICE (Interactive Connectivity Establishment) renegotiates. WebRTC handles this with ICE restarts — a brief glitch but the call continues.

Echo cancellation and noise suppression

Native APIs do most of the work — iOS AVAudioEngine and Android AudioEffect. For high-quality, ML-based noise suppression (Krisp, RTX Voice, Apple Voice Isolation) runs on-device.

Background mode

iOS: declare voIP background mode. Audio continues; video pauses. PushKit can wake the app for incoming calls.

Android: foreground service required for ongoing calls. Notification badge keeps the call alive.

Battery optimization

  • Use hardware-accelerated encoders (VideoToolbox on iOS, MediaCodec on Android)
  • Lower video resolution when device is hot or battery is low
  • Audio-only fallback when bandwidth is very poor

Frequently Asked Questions

Why does Zoom use a server (SFU) and not pure P2P for groups?

Pure mesh P2P does not scale — each client would need to upload N-1 streams. SFU centralizes the fan-out and allows simulcast, end-to-end encryption is harder but possible.

How does FaceTime achieve such low latency?

Heavy codec optimization, P2P when possible, Apple-controlled silicon for hardware acceleration, and integration with the OS for scheduling.

What is the right approach for E2EE in group calls?

SFU forwards encrypted media without decrypting. Each participant has a pairwise key with every other participant (or a group key with rotation on join/leave). Adds complexity but increasingly standard.

Scroll to Top