Low Level Design: Screen Sharing Service

Overview

Screen sharing transmits a host’s display or application window to remote viewers in real time. The challenge is minimizing latency while managing the large and variable data volume of full-screen video. This design covers frame capture, encoding optimizations, WebRTC transport, and viewer-side rendering.

Requirements

Functional Requirements

Share entire screen, a specific application window, or a browser tab
Low latency display of the shared content on viewer clients
Mouse cursor overlay on the shared stream
Annotation tools: draw over the shared screen
Remote control (optional): viewers request control of the host’s input
Support for sharing within a video conference alongside camera streams

Non-Functional Requirements

Glass-to-glass latency under 200ms
Handle screen content at up to 4K 60fps on capable connections
Graceful quality reduction on constrained bandwidth
No perceptible frame tearing on the viewer side

Frame Capture

Browser-Based Capture (getDisplayMedia)

In web clients, navigator.mediaDevices.getDisplayMedia() returns a MediaStream from the OS screen capture API. The browser handles the OS-level capture (WinRT Graphics Capture on Windows, CGDisplayCreateImage on macOS, XComposite on Linux). The app receives a video track ready for WebRTC.

Native Client Capture

Native applications use platform APIs directly:

Windows — DXGI Desktop Duplication API for GPU-accelerated full-screen capture at 60fps with dirty region tracking.
macOS — ScreenCaptureKit (macOS 12.3+) or CGDisplayCreateImageForRect for per-window capture.
Linux — PipeWire portal (modern desktops) or X11 XShmGetImage for shared memory transfers.

Dirty Region Tracking

Modern capture APIs report which rectangular regions of the screen changed between frames (dirty rects). This is critical for performance: only changed regions need to be encoded and transmitted. DXGI provides dirty rects and moved rects (scroll/drag operations) natively.

Delta Encoding

Full frames are expensive. Delta encoding transmits only what changed.

Block-Based Delta

Divide the frame into fixed-size blocks (e.g., 16×16 or 64×64 pixels). Compare each block with the previous frame using a hash (xxHash or CRC32 is fast). Only changed blocks are encoded and sent. On the receiver, changed blocks are blended into the previous frame buffer.

Block comparison pseudocode:
for each block B in current_frame:
    hash = xxhash(B.pixels)
    if hash != prev_hash[B.x][B.y]:
        encode_and_send(B)
        prev_hash[B.x][B.y] = hash

Video Codec Delta (P-frames)

H.264/VP8/VP9 inter-frame prediction (P-frames and B-frames) is the most efficient delta mechanism for continuous motion. The encoder’s motion estimation and residual coding handle dirty regions automatically. For screen content with text and UI elements, screen-content coding modes (H.264 High 4:4:4 Predictive, VP9 screen content tools) dramatically improve compression of sharp edges and repeated patterns.

Encoding Strategy

Codec Selection

Codec	Strengths	Screen Content Tool
H.264	Universal hardware support	High 4:4:4 Predictive profile
VP9	Better compression than H.264	screen-content-tools flag
AV1	Best compression, royalty-free	Intra BC, palette mode

Chroma Subsampling

Standard video uses 4:2:0 chroma subsampling which blurs color boundaries and makes text look soft. Screen content should use 4:4:4 when possible — it preserves sharp text and UI elements at the cost of ~20% larger bitrate.

Quantization for Text

Text and line art need low quantization (high quality) because artifacts are immediately visible. Natural images tolerate higher quantization. Adaptive quantization maps: apply lower QP to regions detected as text (via edge density analysis) and higher QP to background areas.

WebRTC Transport for Screen Sharing

getDisplayMedia as a Track

The display MediaStream track is added to an existing RTCPeerConnection alongside the camera track. Screen sharing typically uses a separate video transceiver:

const screenStream = await navigator.mediaDevices.getDisplayMedia({video: true});
const screenTrack = screenStream.getVideoTracks()[0];
peerConnection.addTransceiver(screenTrack, {direction: 'sendonly'});

Encoding Parameters for Screen

Screen content benefits from different encoding parameters than camera video:

Higher maximum bitrate (screen may have lots of fine detail)
Lower frame rate can be acceptable (15-30fps vs 30fps for camera)
Prefer high quality on keyframes since viewers notice compression on static text
Set contentHint: 'detail' on the track to signal screen content to the browser encoder

const sender = peerConnection.getSenders().find(s => s.track === screenTrack);
const params = sender.getParameters();
params.encodings[0].maxBitrate = 4000000; // 4 Mbps
params.encodings[0].maxFramerate = 30;
await sender.setParameters(params);
screenTrack.contentHint = 'detail';

RTCDataChannel for Cursor and Annotations

The cursor position is not part of the video stream — it is sent separately via RTCDataChannel at high frequency (up to 60 times/second). The viewer renders a software cursor overlay, avoiding cursor lag from video encoding delay.

CursorMessage {
  x: float,      // normalized 0.0-1.0
  y: float,
  type: default | pointer | text | grab,
  timestamp: uint64
}

Region-of-Interest (ROI) Encoding

Not all areas of the screen are equally important. ROI encoding allocates more bits to active regions.

Activity Detection

Track recent dirty rects. Regions with high recent change rate are marked high-priority. Regions that have been static for N frames receive very low bitrate (or are skipped entirely if content is unchanged).

Encoder QP Map

H.264 supports per-macroblock QP deltas (via slice group maps or ROI extensions). VP9 supports per-segment quantizer. AV1 has explicit ROI support. Lower QP in active regions, higher in static areas.

Latency Optimization

Zero-Copy Capture

GPU-accelerated capture (DXGI, ScreenCaptureKit Metal) keeps the frame on GPU memory. Hardware video encoders (NVENC, VideoToolbox, Intel QuickSync) consume the GPU texture directly without a CPU round-trip. This is critical for latency and CPU usage.

Low-Latency Encoder Settings

Set encoder to low-delay mode (no B-frames, no lookahead)
Force intra-refresh instead of full keyframes to avoid large keyframe spikes
Use slice-based encoding so partial frames can be sent before the full frame is encoded
Reduce encode buffer size to 1 frame (no buffering)

Packetization and Jitter Buffer

RTP packetization must not exceed MTU (~1200 bytes to leave room for headers). On the receiver, the jitter buffer smooths out packet reordering. For screen sharing, prefer a shorter jitter buffer (50-100ms) over camera streams to prioritize low latency over smoothness.

Viewer-Side Rendering

Decoding

Decoded frames are rendered to a canvas or video element. Hardware decoding (VideoToolbox on macOS, MediaCodec on Android, D3D11VA on Windows) keeps CPU usage low.

Frame Pacing

requestAnimationFrame-based rendering synchronizes decoded frames to the display refresh cycle. Frames arriving faster than display refresh are dropped (taking the most recent). Frames arriving slower result in repeat display of the last frame, not a black flash.

Cursor Overlay Rendering

Cursor position from the data channel is composited onto the video canvas using a 2D canvas draw call. The cursor is rendered each animation frame using the most recent cursor message, interpolated if desired.

Annotation Layer

Annotations (draw, highlight, arrow) are vector operations broadcast via data channel or signaling server. Each viewer renders them onto a transparent canvas overlay above the video. No re-encoding of the video stream is needed.

AnnotationEvent {
  type: draw | erase | arrow | highlight | clear,
  points: [{x, y}, ...],
  color: string,
  strokeWidth: number,
  authorPeerId: string,
  timestamp: uint64
}

Remote Control

A viewer requests control. The host approves. The viewer’s mouse/keyboard events are sent via data channel to the host client, which injects them via OS APIs (SendInput on Windows, CGEvent on macOS, XSendEvent on Linux). The host’s screen updates are reflected back through the normal screen share stream.

InputEvent {
  type: mousemove | mousedown | mouseup | keydown | keyup,
  x: float,      // normalized coordinates
  y: float,
  button: int,
  keyCode: int,
  modifiers: [shift, ctrl, alt, meta]
}

Database Schema

screen_share_sessions(
  session_id UUID PK,
  meeting_id UUID FK,
  host_peer_id VARCHAR(64),
  started_at TIMESTAMP,
  ended_at TIMESTAMP,
  resolution_w INT,
  resolution_h INT,
  max_fps INT,
  codec VARCHAR(16)
)

Failure Handling

Capture API failure — User revokes screen share permission. The track ends; client fires track.onended. The signaling server notifies viewers that sharing has stopped.
Encoder error — Fall back from hardware encoder to software encoder. Log the failure. If software encoder also fails, stop the session and notify the host.
Network congestion — Congestion controller reduces bitrate. The encoder reduces resolution or frame rate. Priority is maintaining frame rate over resolution — viewers prefer smooth low-res over choppy high-res.

Summary

Efficient screen sharing requires GPU-accelerated zero-copy capture, dirty-region-aware delta encoding, screen-content-optimized codecs, low-latency encoder settings, and a separate cursor overlay channel. The SFU routes the screen track alongside camera tracks. ROI encoding and adaptive bitrate keep quality high where it matters while respecting bandwidth constraints.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How does screen sharing capture and encode frames efficiently?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Screen sharing capture uses OS-level APIs — such as Windows Graphics Capture, macOS ScreenCaptureKit, or Linux PipeWire — to grab framebuffer snapshots at a target frame rate (typically 15–30 fps). The raw frames are passed to a hardware-accelerated video encoder (H.264, H.265, or VP9 via NVENC, VideoToolbox, or VA-API) which produces compressed inter-coded frames. Because most screen content changes only partially between frames, the encoder achieves very high compression by storing only the changed macroblocks in P-frames, keeping bandwidth low. The encoded bitstream is then packetized over RTP and sent to the SFU or peer.”
}
},
{
“@type”: “Question”,
“name”: “What is delta encoding and why is it important for screen sharing?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Delta encoding in screen sharing means transmitting only the pixels or regions that changed between consecutive frames rather than full frames. Because screen content (documents, code editors, web pages) is mostly static between keystrokes or scrolls, delta encoding reduces bandwidth by orders of magnitude compared to sending full frames. Implementations range from simple bounding-box diffs to sophisticated dirty-region tracking supplied by the OS damage API. Remote desktop protocols like RFB (VNC) and RDP are built around delta encoding; WebRTC-based screen sharing achieves a similar effect through inter-frame video compression (P-frames and B-frames).”
}
},
{
“@type”: “Question”,
“name”: “How does screen sharing handle cursor movement and annotation overlays?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Cursor movement is typically transmitted as a separate lightweight out-of-band signal (x/y coordinates at high frequency) rather than encoding the cursor into the video stream, which would force costly keyframes on every mouse move. The receiver renders the cursor itself using the host’s cursor image, achieving smooth 60+ fps cursor motion even at low video frame rates. Annotation overlays (drawings, laser pointer, highlights) are similarly composited on the receiver side using a transparent canvas layered over the video element, so annotation data (vector paths, timestamps) travels as structured data over a data channel or WebSocket and is rendered locally without degrading video quality.”
}
},
{
“@type”: “Question”,
“name”: “How does screen sharing optimize encoding for text-heavy content vs video content?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Text-heavy content (code, documents, spreadsheets) benefits from lossless or near-lossless encoding at lower frame rates because sharp edges and fine text are destroyed by the DCT quantization used in lossy codecs. Encoders can apply a higher QP (quantization parameter) floor for motion-heavy regions and a lower QP for static text regions, or switch to lossless mode (H.264 lossless, VP9 lossless) when motion is low. Video content within a shared screen requires the opposite: higher frame rates and accepting more lossy compression to keep motion smooth. Adaptive content-type detection — distinguishing screen regions containing video playback from static UI — allows the encoder to apply different strategies per region, a feature found in encoders like Google’s open-source libaom and in proprietary RDP implementations.”
}
}
]
}