Video Call Service Low-Level Design: WebRTC Signaling, Media Relay, and Quality Adaptation

WebRTC Overview

WebRTC (Web Real-Time Communication) is a browser and native SDK standard for peer-to-peer audio and video. It handles media capture, codec negotiation, NAT traversal, encryption, and congestion control — all without requiring a media server for 1:1 calls. The application is responsible only for signaling: exchanging the session descriptions and network candidates that WebRTC needs to establish the peer connection.

WebRTC Peer Connection Internals

A WebRTC peer connection involves several protocol layers:

ICE (Interactive Connectivity Establishment): discovers network paths between peers, selecting the best available route
DTLS (Datagram TLS): establishes encrypted channel over UDP for key exchange
SRTP (Secure Real-time Transport Protocol): encrypts media streams using keys negotiated via DTLS
SDP (Session Description Protocol): describes media capabilities — codecs, bitrates, directions, and network candidates

Signaling Server

WebRTC does not define a signaling protocol — applications implement their own. A WebSocket-based signaling server relays SDP offers/answers and ICE candidates between peers:

Caller creates an SDP offer describing its media capabilities
Offer sent to signaling server → forwarded to callee
Callee creates SDP answer → sent back via signaling server
Both peers exchange ICE candidates (network addresses) through signaling
WebRTC selects the best ICE candidate pair and establishes the peer connection

The signaling server is only in the control plane — it carries no media after connection establishment.

STUN Server

Most clients are behind NAT — their internal IP (192.168.x.x) is not reachable by peers. STUN (RFC 5389) solves this:

Client sends a request to the STUN server from behind NAT
STUN server responds with the client's public IP:port as observed at the server
Client includes this reflexive candidate in its ICE candidate list

STUN is cheap to operate — requests are tiny and stateless. Google's public STUN servers (stun.l.google.com:19302) are commonly used in development.

TURN Relay

Symmetric NATs block STUN-discovered candidates — roughly 8% of real-world connections require a relay. TURN (RFC 5766) allocates a public relay address on the server and forwards all media through it:

Client authenticates with TURN server and requests an allocation
TURN server assigns a relay IP:port
All media packets flow through the TURN server — significant bandwidth cost
TURN is used only as a fallback after STUN and direct P2P attempts fail

Coturn is the standard open-source TURN server. Plan for ~1.5x the call's media bandwidth in TURN infrastructure costs for the fraction of calls that need it.

P2P vs SFU for Group Calls

1:1 calls work well P2P. Group calls do not scale with pure P2P:

P2P mesh (N participants): each participant sends N-1 streams and receives N-1 streams. Upload bandwidth scales as O(N). At 5 participants with 1Mbps video each, every participant uploads 4Mbps — unsustainable on typical home connections.
SFU (Selective Forwarding Unit): each participant sends one stream to the SFU. The SFU forwards each participant's stream to all other participants. Upload per participant stays constant at 1 stream regardless of group size.

SFU Architecture

The SFU operates at the RTP packet level — it forwards video/audio packets without decoding:

Simulcast: senders transmit 3 quality layers simultaneously (e.g., 720p/1080p/360p). The SFU selects which layer to forward to each recipient based on their available downlink bandwidth.
Spatial and temporal scalability: SVC (Scalable Video Coding) with VP9/AV1 allows the SFU to drop packet layers without requiring re-encoding.

Open-source SFU implementations: Mediasoup (Node.js), Janus (C), Livekit (Go).

Adaptive Bitrate

WebRTC has built-in congestion control. The receiver measures packet loss, jitter, and inter-packet delay, then sends bandwidth estimates back to the sender via RTCP:

REMB (Receiver Estimated Maximum Bitrate): receiver estimates available downlink and signals it to the sender
TWCC (Transport-Wide Congestion Control): sender receives per-packet arrival time reports and runs its own bandwidth estimation (GCC algorithm)

The sender responds by adjusting bitrate — reducing resolution or framerate to fit the available bandwidth. This happens in real time, continuously.

Quality Metrics and Recording

Quality metrics available via the WebRTC Stats API and RTCP reports:

Packet loss percentage — above 2% is noticeable, above 10% causes freezes
Jitter — variation in packet arrival times; large jitter degrades audio
Round-trip time — above 300ms creates perceptible conversational delay

Expose these to the UI as a call quality indicator (good/fair/poor).

Recording: the SFU writes each participant's incoming RTP stream to disk. Post-call, a compositor process combines individual tracks into a mixed recording, transcodes to MP4, and stores in S3. Alternatively, record the mixed stream in real time using GStreamer or FFmpeg connected to the SFU's RTP output.

Screen Sharing and Background Effects

Screen sharing: getDisplayMedia() captures a screen or window as a MediaStream. This is added as a separate video track in the WebRTC peer connection — recipients receive both the camera track and the screen share track simultaneously.

Virtual background / blur: applied client-side before the video track enters WebRTC. Two approaches: WebGL shader for real-time blur of a fixed background region, or MediaPipe Selfie Segmentation for accurate person/background separation. The processed frames replace the raw camera track, so no server-side changes are needed.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What is the role of the signaling server in a WebRTC video call system, and what data does it exchange?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The signaling server is a rendezvous point that brokers the SDP (Session Description Protocol) offer/answer exchange and ICE candidate sharing between peers — it does not carry media. Flow: caller sends SDP offer (codec preferences, resolution, bandwidth) to signaling server; server forwards to callee; callee responds with SDP answer; both sides exchange ICE candidates (IP:port pairs for potential media paths). The signaling server can be a lightweight WebSocket server (e.g., Node.js + Redis Pub/Sub for horizontal scaling). It must handle room membership, participant state (connected/disconnected), and optionally relay chat messages and screen-share state. Stateless design: store session state (room members, SDP, ICE candidates) in Redis with TTL; any signaling server node can handle any message. Authentication via JWT validated on WebSocket upgrade.”
}
},
{
“@type”: “Question”,
“name”: “When does WebRTC need a TURN server, and how do you size and deploy TURN infrastructure?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “WebRTC first attempts a direct peer-to-peer path using STUN to discover public IP/port mappings. STUN fails when both peers are behind symmetric NATs (each outbound connection gets a different external port), or when corporate firewalls block UDP. In these cases — roughly 15–20% of calls in enterprise environments — a TURN (Traversal Using Relays around NAT) server relays all media. TURN is UDP-intensive: a single 720p video call uses ~1.5–2 Mbps. Size TURN nodes by network bandwidth, not CPU. A 1 Gbps NIC supports ~500 concurrent relayed calls. Deploy TURN servers geographically close to users (latency dominates quality) and use anycast or GeoDNS to route. Authenticate TURN allocations with time-limited HMAC credentials (RFC 8489 TURN REST API) to prevent unauthorized relay abuse. Monitor relay utilization and fallback rate as key SLIs.”
}
},
{
“@type”: “Question”,
“name”: “How does an SFU (Selective Forwarding Unit) differ from a mesh topology, and when should you use each?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “In a full-mesh WebRTC call with N participants, each peer maintains N-1 peer connections and uploads N-1 streams — upload bandwidth scales as O(N²). This works for N≤4 but collapses at N=8+ due to uplink saturation. An SFU receives one media stream per participant and selectively forwards each participant's stream to others, so each client only uploads once regardless of N. The SFU decides which streams to forward based on subscriber bandwidth (simulcast layers or SVC). The SFU does not decode or re-encode media, keeping CPU costs low. Use mesh for ≤4 participants with controlled network conditions; use SFU for 5–100 participants; use MCU (Multipoint Control Unit, which mixes streams server-side) only when endpoint capabilities are severely constrained (e.g., feature phones), as MCU transcoding is CPU-expensive and adds latency.”
}
},
{
“@type”: “Question”,
“name”: “How does adaptive bitrate work in a video call, and what signals drive quality changes?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “WebRTC uses REMB (Receiver Estimated Maximum Bitrate) and Transport-wide Congestion Control (TWCC) to estimate available bandwidth. The receiver timestamps incoming packets; the sender uses these timestamps to compute packet inter-arrival jitter and loss rate, feeding a congestion controller (GoogCC in Chrome/libwebrtc). GoogCC maintains a bandwidth estimate and adjusts the encoder target bitrate accordingly — typically within 200–500ms of a congestion event. On the sender, simulcast publishes 2–3 spatial layers (e.g., 180p/360p/720p) simultaneously; the SFU selects which layer to forward to each subscriber based on their REMB signal, switching layers without a keyframe request penalty. Quality adaptation priority order: (1) reduce framerate before resolution, (2) reduce resolution before dropping to audio-only, (3) pause video if bandwidth falls below ~100 kbps. Surface these events as quality change events to the application for UI feedback.”
}
}
]
}