Low Level Design: Video Conferencing Service

Overview

A video conferencing service enables real-time audio and video communication between multiple participants. At scale, this involves signaling servers, media servers, peer-to-peer WebRTC negotiation, adaptive streaming, and session management. This low level design walks through every major subsystem.

Requirements

Functional Requirements

  • Create and join meetings with a unique room ID
  • Real-time audio and video streams between participants
  • Mute/unmute audio and video
  • Screen sharing within a meeting
  • Recording of sessions
  • Chat messages within a meeting
  • Participant management: admit, remove, waiting room

Non-Functional Requirements

  • End-to-end latency under 150ms for media
  • Support 1,000+ concurrent meetings, each with up to 100 participants
  • 99.9% uptime
  • Graceful degradation under poor network conditions

High-Level Architecture

The system has four primary planes:

  1. Signaling Plane — WebSocket-based server that exchanges SDP offers/answers and ICE candidates to set up peer connections.
  2. Media Plane — Selective Forwarding Unit (SFU) that receives encoded media from each participant and routes it to others without decoding.
  3. Control Plane — REST/gRPC services handling room state, participant lists, permissions, and recording triggers.
  4. Storage Plane — Object storage for recordings, a relational DB for room metadata, and Redis for ephemeral session state.

Signaling Server

Signaling bootstraps WebRTC connections. It does not carry media — it only exchanges session descriptions and network candidates.

Protocol

Clients connect via WebSocket. Messages are JSON envelopes:

{
  type: join | offer | answer | candidate | leave | chat | participant-update,
  roomId: string,
  fromPeerId: string,
  toPeerId: string | broadcast,
  payload: { ... }
}

SDP Negotiation Flow

  1. Caller creates an RTCPeerConnection and calls createOffer().
  2. Caller sets local description and sends the SDP offer to the signaling server.
  3. Signaling server routes the offer to the callee (or SFU).
  4. Callee calls setRemoteDescription(offer), then createAnswer().
  5. Callee sends the SDP answer back through the signaling server.
  6. Both sides exchange ICE candidates as they are discovered.

ICE Candidate Trickle

ICE candidates are sent incrementally as they are gathered (trickle ICE), reducing connection setup time. The signaling server buffers candidates if the remote description has not yet been applied.

Scalability

Signaling servers are stateless per connection but need room state. Use Redis pub/sub so any signaling node can forward a message to the correct WebSocket regardless of which node holds that socket. Sticky sessions via load balancer are an alternative but complicate horizontal scaling.

WebRTC Peer Connections

STUN and TURN

  • STUN — Allows a client behind NAT to discover its public IP:port (reflexive candidate). Cheap; no media relay needed.
  • TURN — Relays media when NAT traversal fails (symmetric NAT). Expensive in bandwidth. Deploy TURN servers in multiple regions. Use short-lived TURN credentials generated server-side to prevent abuse.

Codec Negotiation

SDP includes a prioritized codec list. Prefer VP9 or AV1 for video (better compression), Opus for audio. The SFU and endpoints negotiate a common codec. Simulcast allows a client to send multiple resolutions; the SFU selects which layer to forward per subscriber.

Data Channels

RTCDataChannel carries low-latency non-media data: chat messages, cursor positions during screen sharing, reactions. Data channels use SCTP over DTLS, so they share the same encrypted transport as media.

Selective Forwarding Unit (SFU)

An SFU is the heart of a multi-party call. Each participant sends one set of streams to the SFU; the SFU selectively forwards streams to other participants without decoding or re-encoding.

Why SFU over MCU

  • MCU (Multipoint Control Unit) mixes all streams server-side — lower client CPU but very high server CPU and latency.
  • SFU is much cheaper server-side; clients decode multiple streams independently. Scales better and adds less latency.

Simulcast and SVC

With simulcast, a publisher sends three spatial layers (e.g., 1080p, 540p, 180p). The SFU forwards the appropriate layer to each subscriber based on their bandwidth and viewport. Scalable Video Coding (SVC) encodes layers within a single stream; the SFU drops higher enhancement layers for bandwidth-constrained subscribers.

SFU Data Structures

Room {
  roomId: string
  participants: Map<peerId, Participant>
  sfuNode: string
}

Participant {
  peerId: string
  userId: string
  sendTransport: WebRtcTransport
  recvTransports: Map<peerId, WebRtcTransport>
  producers: Map<trackId, Producer>   // outgoing media
  consumers: Map<trackId, Consumer>   // incoming media
  simulcastLayers: [high, medium, low]
}

SFU Cluster

Each SFU node handles a set of rooms. A coordinator (backed by Redis or etcd) assigns rooms to SFU nodes. When a node is overloaded, new rooms are assigned elsewhere. For very large meetings (>50 participants) a cascade SFU topology is used: regional SFUs forward to a central SFU which fans out globally.

Adaptive Bitrate (ABR)

ABR dynamically adjusts video quality based on network conditions.

REMB and TWCC

  • REMB (Receiver Estimated Maximum Bitrate) — Receiver sends back an estimated available bandwidth. Sender adjusts bitrate accordingly.
  • TWCC (Transport-Wide Congestion Control) — Transport-wide sequence numbers let the sender’s congestion control algorithm (e.g., Google Congestion Control / GCC) estimate packet loss and delay variation.

Layer Switching

The SFU switches simulcast layers based on subscriber feedback. Switching happens on keyframe boundaries to avoid visual artifacts. The SFU can request a keyframe from the publisher via RTCP PLI (Picture Loss Indication) when switching to a higher layer.

Recording

Approaches

  1. Composite recording — A headless browser renders the meeting UI and records via screen capture. Simple but CPU-intensive and limited layout flexibility.
  2. SFU-based recording — The SFU’s consumer tracks are piped to a recording service. Each participant’s stream is stored as a separate track, then composited post-call. More flexible; supports custom layouts.

Pipeline

SFU Consumer -> RTP forwarder -> FFmpeg/GStreamer -> Muxer (WebM/MP4) -> Object Storage (S3)

Use a message queue (Kafka or SQS) to trigger post-processing: transcoding to multiple formats, thumbnail generation, and uploading to CDN for playback.

Participant Management

Room State Machine

SCHEDULED -> WAITING -> ACTIVE -> ENDED

Waiting Room

Participants join a lobby. The host admits them individually or all at once. Lobby state is stored in Redis with a TTL. The signaling server sends a participant-admitted event when the host grants access.

Roles and Permissions

Role: host | co-host | participant | viewer
Permissions per role:
  mute_others, remove_participant, start_recording, share_screen, send_chat

Permissions are enforced both client-side (UI) and server-side (signaling server rejects unauthorized commands).

Database Schema

meetings(
  meeting_id UUID PK,
  host_user_id BIGINT,
  room_code VARCHAR(12) UNIQUE,
  title VARCHAR(255),
  status ENUM('scheduled','active','ended'),
  scheduled_at TIMESTAMP,
  started_at TIMESTAMP,
  ended_at TIMESTAMP,
  max_participants INT,
  recording_enabled BOOL
)

participants(
  id BIGINT PK,
  meeting_id UUID FK,
  user_id BIGINT,
  peer_id VARCHAR(64),
  role ENUM('host','co-host','participant','viewer'),
  joined_at TIMESTAMP,
  left_at TIMESTAMP
)

recordings(
  recording_id UUID PK,
  meeting_id UUID FK,
  storage_url TEXT,
  duration_seconds INT,
  size_bytes BIGINT,
  status ENUM('processing','ready','failed'),
  created_at TIMESTAMP
)

Chat and Reactions

In-meeting chat is sent via the signaling server’s WebSocket. Messages are stored in Redis during the meeting and flushed to a relational table on meeting end. Reactions (emoji bursts) are ephemeral — broadcast via WebSocket but not persisted.

Security

  • All media is encrypted with DTLS-SRTP — this is mandatory in WebRTC.
  • Signaling channel uses WSS (TLS).
  • Room join requires a server-issued token (JWT) tied to the meeting and user.
  • TURN credentials are short-lived HMAC tokens to prevent abuse.
  • End-to-End Encryption (E2EE): Insertable Streams API allows encrypting media in the browser before sending; the SFU forwards ciphertext. Key exchange happens out-of-band through a key server accessible only to participants.

Failure Handling

  • Signaling server crash — Client retries WebSocket connection. Reconnect flow re-establishes peer connections; ICE restart renegotiates network paths without full re-negotiation.
  • SFU node failure — Coordinator detects unhealthy node via heartbeat. Active rooms are migrated to another SFU node; clients perform ICE restart to the new SFU address.
  • Packet loss — NACK requests retransmission. For audio, redundancy (RFC 2198) and FEC (ULPFEC) recover lost packets. For video, PLI triggers a keyframe.

Summary

A production video conferencing service is built around a WebSocket signaling layer for session setup, a scalable SFU cluster for media routing, adaptive bitrate control for quality resilience, and a compositing pipeline for recording. Participant management, permissions, and end-to-end encryption round out a complete system.

See also: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence

See also: Meta Interview Guide 2026: Facebook, Instagram, WhatsApp Engineering

See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering

Scroll to Top