Low Level Design: Video Conferencing Service

Overview

A video conferencing service enables real-time audio and video communication between multiple participants. At scale, this involves signaling servers, media servers, peer-to-peer WebRTC negotiation, adaptive streaming, and session management. This low level design walks through every major subsystem.

Requirements

Functional Requirements

Create and join meetings with a unique room ID
Real-time audio and video streams between participants
Mute/unmute audio and video
Screen sharing within a meeting
Recording of sessions
Chat messages within a meeting
Participant management: admit, remove, waiting room

Non-Functional Requirements

End-to-end latency under 150ms for media
Support 1,000+ concurrent meetings, each with up to 100 participants
99.9% uptime
Graceful degradation under poor network conditions

High-Level Architecture

The system has four primary planes:

Signaling Plane — WebSocket-based server that exchanges SDP offers/answers and ICE candidates to set up peer connections.
Media Plane — Selective Forwarding Unit (SFU) that receives encoded media from each participant and routes it to others without decoding.
Control Plane — REST/gRPC services handling room state, participant lists, permissions, and recording triggers.
Storage Plane — Object storage for recordings, a relational DB for room metadata, and Redis for ephemeral session state.

Signaling Server

Signaling bootstraps WebRTC connections. It does not carry media — it only exchanges session descriptions and network candidates.

Protocol

Clients connect via WebSocket. Messages are JSON envelopes:

{
  type: join | offer | answer | candidate | leave | chat | participant-update,
  roomId: string,
  fromPeerId: string,
  toPeerId: string | broadcast,
  payload: { ... }
}

SDP Negotiation Flow

Caller creates an RTCPeerConnection and calls createOffer().
Caller sets local description and sends the SDP offer to the signaling server.
Signaling server routes the offer to the callee (or SFU).
Callee calls setRemoteDescription(offer), then createAnswer().
Callee sends the SDP answer back through the signaling server.
Both sides exchange ICE candidates as they are discovered.

ICE Candidate Trickle

ICE candidates are sent incrementally as they are gathered (trickle ICE), reducing connection setup time. The signaling server buffers candidates if the remote description has not yet been applied.

Scalability

Signaling servers are stateless per connection but need room state. Use Redis pub/sub so any signaling node can forward a message to the correct WebSocket regardless of which node holds that socket. Sticky sessions via load balancer are an alternative but complicate horizontal scaling.

WebRTC Peer Connections

STUN and TURN

STUN — Allows a client behind NAT to discover its public IP:port (reflexive candidate). Cheap; no media relay needed.
TURN — Relays media when NAT traversal fails (symmetric NAT). Expensive in bandwidth. Deploy TURN servers in multiple regions. Use short-lived TURN credentials generated server-side to prevent abuse.

Codec Negotiation

SDP includes a prioritized codec list. Prefer VP9 or AV1 for video (better compression), Opus for audio. The SFU and endpoints negotiate a common codec. Simulcast allows a client to send multiple resolutions; the SFU selects which layer to forward per subscriber.

Data Channels

RTCDataChannel carries low-latency non-media data: chat messages, cursor positions during screen sharing, reactions. Data channels use SCTP over DTLS, so they share the same encrypted transport as media.

Selective Forwarding Unit (SFU)

An SFU is the heart of a multi-party call. Each participant sends one set of streams to the SFU; the SFU selectively forwards streams to other participants without decoding or re-encoding.

Why SFU over MCU

MCU (Multipoint Control Unit) mixes all streams server-side — lower client CPU but very high server CPU and latency.
SFU is much cheaper server-side; clients decode multiple streams independently. Scales better and adds less latency.

Simulcast and SVC

With simulcast, a publisher sends three spatial layers (e.g., 1080p, 540p, 180p). The SFU forwards the appropriate layer to each subscriber based on their bandwidth and viewport. Scalable Video Coding (SVC) encodes layers within a single stream; the SFU drops higher enhancement layers for bandwidth-constrained subscribers.

SFU Data Structures

Room {
  roomId: string
  participants: Map<peerId, Participant>
  sfuNode: string
}

Participant {
  peerId: string
  userId: string
  sendTransport: WebRtcTransport
  recvTransports: Map<peerId, WebRtcTransport>
  producers: Map<trackId, Producer>   // outgoing media
  consumers: Map<trackId, Consumer>   // incoming media
  simulcastLayers: [high, medium, low]
}

SFU Cluster

Each SFU node handles a set of rooms. A coordinator (backed by Redis or etcd) assigns rooms to SFU nodes. When a node is overloaded, new rooms are assigned elsewhere. For very large meetings (>50 participants) a cascade SFU topology is used: regional SFUs forward to a central SFU which fans out globally.

Adaptive Bitrate (ABR)

ABR dynamically adjusts video quality based on network conditions.

REMB and TWCC

REMB (Receiver Estimated Maximum Bitrate) — Receiver sends back an estimated available bandwidth. Sender adjusts bitrate accordingly.
TWCC (Transport-Wide Congestion Control) — Transport-wide sequence numbers let the sender’s congestion control algorithm (e.g., Google Congestion Control / GCC) estimate packet loss and delay variation.

Layer Switching

The SFU switches simulcast layers based on subscriber feedback. Switching happens on keyframe boundaries to avoid visual artifacts. The SFU can request a keyframe from the publisher via RTCP PLI (Picture Loss Indication) when switching to a higher layer.

Recording

Approaches

Composite recording — A headless browser renders the meeting UI and records via screen capture. Simple but CPU-intensive and limited layout flexibility.
SFU-based recording — The SFU’s consumer tracks are piped to a recording service. Each participant’s stream is stored as a separate track, then composited post-call. More flexible; supports custom layouts.

Pipeline

SFU Consumer -> RTP forwarder -> FFmpeg/GStreamer -> Muxer (WebM/MP4) -> Object Storage (S3)

Use a message queue (Kafka or SQS) to trigger post-processing: transcoding to multiple formats, thumbnail generation, and uploading to CDN for playback.

Participant Management

Room State Machine

SCHEDULED -> WAITING -> ACTIVE -> ENDED

Waiting Room

Participants join a lobby. The host admits them individually or all at once. Lobby state is stored in Redis with a TTL. The signaling server sends a participant-admitted event when the host grants access.

Roles and Permissions

Role: host | co-host | participant | viewer
Permissions per role:
  mute_others, remove_participant, start_recording, share_screen, send_chat

Permissions are enforced both client-side (UI) and server-side (signaling server rejects unauthorized commands).

Database Schema

meetings(
  meeting_id UUID PK,
  host_user_id BIGINT,
  room_code VARCHAR(12) UNIQUE,
  title VARCHAR(255),
  status ENUM('scheduled','active','ended'),
  scheduled_at TIMESTAMP,
  started_at TIMESTAMP,
  ended_at TIMESTAMP,
  max_participants INT,
  recording_enabled BOOL
)

participants(
  id BIGINT PK,
  meeting_id UUID FK,
  user_id BIGINT,
  peer_id VARCHAR(64),
  role ENUM('host','co-host','participant','viewer'),
  joined_at TIMESTAMP,
  left_at TIMESTAMP
)

recordings(
  recording_id UUID PK,
  meeting_id UUID FK,
  storage_url TEXT,
  duration_seconds INT,
  size_bytes BIGINT,
  status ENUM('processing','ready','failed'),
  created_at TIMESTAMP
)

Chat and Reactions

In-meeting chat is sent via the signaling server’s WebSocket. Messages are stored in Redis during the meeting and flushed to a relational table on meeting end. Reactions (emoji bursts) are ephemeral — broadcast via WebSocket but not persisted.

Security

All media is encrypted with DTLS-SRTP — this is mandatory in WebRTC.
Signaling channel uses WSS (TLS).
Room join requires a server-issued token (JWT) tied to the meeting and user.
TURN credentials are short-lived HMAC tokens to prevent abuse.
End-to-End Encryption (E2EE): Insertable Streams API allows encrypting media in the browser before sending; the SFU forwards ciphertext. Key exchange happens out-of-band through a key server accessible only to participants.

Failure Handling

Signaling server crash — Client retries WebSocket connection. Reconnect flow re-establishes peer connections; ICE restart renegotiates network paths without full re-negotiation.
SFU node failure — Coordinator detects unhealthy node via heartbeat. Active rooms are migrated to another SFU node; clients perform ICE restart to the new SFU address.
Packet loss — NACK requests retransmission. For audio, redundancy (RFC 2198) and FEC (ULPFEC) recover lost packets. For video, PLI triggers a keyframe.

Summary

A production video conferencing service is built around a WebSocket signaling layer for session setup, a scalable SFU cluster for media routing, adaptive bitrate control for quality resilience, and a compositing pipeline for recording. Participant management, permissions, and end-to-end encryption round out a complete system.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What is a selective forwarding unit (SFU) and why is it used in video conferencing?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A selective forwarding unit (SFU) is a media server that receives media streams from each participant and selectively forwards them to other participants without mixing or transcoding. It is used in video conferencing because it dramatically reduces the upload bandwidth required from each client — each client sends one stream to the SFU rather than one stream per participant. The SFU can also forward different quality layers (via simulcast or SVC) to different receivers based on their available bandwidth, making it the standard architecture for scalable multi-party video calls.”
}
},
{
“@type”: “Question”,
“name”: “How does WebRTC signaling work in a video conferencing system?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “WebRTC signaling is the out-of-band process of exchanging session metadata so that two peers (or a peer and an SFU) can establish a direct media connection. The caller creates an SDP offer describing its media capabilities and ICE candidates (possible network paths), sends it to a signaling server (typically over WebSocket), which forwards it to the callee. The callee responds with an SDP answer, and both sides exchange ICE candidates until a working network path is found. The signaling server itself carries no media; once the ICE negotiation succeeds the media flows directly over DTLS-SRTP.”
}
},
{
“@type”: “Question”,
“name”: “How does adaptive bitrate work in video conferencing?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Adaptive bitrate (ABR) in video conferencing continuously adjusts the video encoding rate to match available network bandwidth. The sender estimates available bandwidth using RTCP receiver reports, REMB (Receiver Estimated Maximum Bitrate), or Transport-CC feedback, then instructs the encoder to target a bitrate that fits within that estimate. With simulcast the sender encodes multiple quality layers simultaneously and the SFU switches which layer it forwards to each receiver; with SVC the sender encodes a single scalable stream and the SFU drops enhancement layers when bandwidth is constrained. The result is that video quality degrades gracefully instead of freezing or dropping.”
}
},
{
“@type”: “Question”,
“name”: “How is cloud recording implemented for a video conference?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Cloud recording for a video conference is typically implemented by connecting a bot participant to the call that acts as a media sink. The recording bot subscribes to all participant streams via the SFU, receives the raw RTP media, and pipes it into a media processing pipeline (e.g., FFmpeg) that mixes or stores the streams. The resulting audio/video is written to object storage (S3, GCS) either as a live segmented HLS stream or as a single file uploaded at call end. Metadata (participant names, timestamps, layout) is stored in a database. Post-processing jobs may then transcode the recording, generate thumbnails, run transcription, and produce a download link.”
}
}
]
}