Low Level Design: Session Replay Service

A session replay service records a user’s browser interactions and reconstructs them for playback — enabling UX debugging, funnel analysis, and support workflows. The design spans client-side capture, event transport, server-side storage, playback, and privacy compliance.

Client-Side Event Capture

The capture SDK runs in the browser and records DOM activity using the following event sources:

MutationObserver: watches for DOM additions, removals, and attribute changes. Each mutation is serialized as a diff against a virtual DOM snapshot.
Mouse and touch events: mousemove (throttled to 50ms), click, touchstart, touchend — recorded with coordinates relative to the viewport.
Scroll events: scroll position captured on the window and scrollable containers, throttled to 100ms.
Input events: keydown and input on form fields — captured as character-level events but immediately masked for PII (see masking section).
Network timing: PerformanceObserver on resource and navigation entries for correlating UI events with network activity.
Console and error events: window.onerror and unhandledrejection captured for error context during replay.

The initial capture also takes a full DOM snapshot using a serialization library (similar to rrweb’s snapshot model), recording the full HTML structure, computed styles, and iframes at session start. Subsequent events are recorded as incremental mutations against this base snapshot.

Batched Upload with Compression

Events are buffered in memory and flushed to the server in batches. Flush triggers:

Buffer reaches 64KB of uncompressed event data
5 seconds have elapsed since the last flush
Page visibility changes to hidden (beforeunload / visibilitychange)

Each batch is serialized to JSON, compressed with pako (zlib/deflate), and uploaded via navigator.sendBeacon (for unload scenarios) or a standard fetch POST. The payload includes a session ID, chunk sequence number, and timestamp range. Compression typically reduces payload size by 70-85% for repetitive DOM event data.

Server-Side Event Stream Storage

The ingest API receives event chunks and writes them to object storage (S3-compatible) with the path structure:

sessions/{orgId}/{date}/{sessionId}/chunk_{seq:06d}.bin

Chunks are stored compressed as received. A SessionIndex record in Postgres tracks:

SessionIndex {
  sessionId: UUID
  orgId: string
  userId: string | null
  startedAt: timestamp
  lastChunkAt: timestamp
  chunkCount: int
  durationMs: int
  pageUrl: string
  userAgent: string
  tags: JSON               // e.g., {plan: pro, country: US}
  isComplete: bool
}

A separate EventMetadata table indexes significant events (rage clicks, errors, URL changes) extracted during ingestion for fast search without replaying the full stream. This is populated asynchronously by an ingestion worker consuming from a Kafka topic that the ingest API publishes to.

Playback Reconstruction Engine

The playback client fetches chunks from a signed URL (pre-signed S3 URLs with a short TTL, generated by the API). Chunks are fetched sequentially and decompressed in-browser. The reconstruction engine:

Applies the initial DOM snapshot to a sandboxed iframe
Replays mutation events in sequence, applying them to the iframe’s virtual DOM
Simulates mouse cursor position, scroll state, and click highlights as overlays
Advances a timeline scrubber tied to event timestamps

Seeking to a specific timestamp requires replaying all events from the snapshot up to the target time. To avoid full replay on seek, the system periodically saves keyframe snapshots during ingestion (every 30 seconds of session time), stored as separate chunk files. The playback engine picks the nearest preceding keyframe and replays from there.

PII Masking

PII masking operates at two levels:

Client-side (before upload): Input events on fields marked with data-replay-mask, type=password, or matching a CSS class blocklist (e.g., .pii-field, .cc-number) have their character content replaced with asterisks before the event is buffered. The field element is still recorded (for layout purposes) but its value is never transmitted.

Server-side (during ingestion): The ingestion worker applies a secondary scrubbing pass using regex patterns for common PII formats (email, SSN, credit card numbers) against text node mutations. Matched content is replaced with a placeholder token before the chunk is written to object storage. This provides defense-in-depth for cases where client-side masking is bypassed or not configured.

Elements with data-replay-block are excluded entirely: they are replaced with a solid rectangle of the same dimensions in both capture and playback.

Retention Policy and Data Deletion

Retention is configured per organization (e.g., 30, 90, or 365 days). A nightly RetentionJob queries the SessionIndex for sessions where startedAt < NOW() - retention_days, generates a list of S3 keys from the chunk naming pattern, issues batch deletes to S3, and hard-deletes the SessionIndex and EventMetadata rows. For GDPR right-to-erasure requests, the ErasureService accepts a userId, finds all associated sessions, and runs the same deletion pipeline immediately, logging the erasure event to an audit table.

Session Indexing for Search

The EventMetadata table supports queries like find all sessions where the user encountered a JS error on the checkout page. Columns indexed:

orgId + startedAt (composite, for time-range filtering)
userId (for per-user session lookup)
pageUrl prefix (using a text_pattern_ops index in Postgres)
tags (GIN index on JSON column for arbitrary tag filtering)

For larger deployments, the session index can be moved to Elasticsearch to support full-text search on page titles, custom event names, and error messages extracted during ingestion.

Key Design Decisions

Object storage for chunks: cheap, durable, and scales to arbitrary session volume without capacity planning
Keyframe snapshots every 30 seconds to make seeking O(1) in wall time rather than O(session length)
Two-layer PII masking (client + server) to satisfy compliance requirements defensively
sendBeacon for end-of-session flush to avoid data loss on tab close
Async EventMetadata extraction via Kafka to keep ingest latency low

Interview tip: the interviewer usually wants to hear about the DOM snapshot + incremental mutation model early — it explains why you don’t just record a video (storage cost, no DOM interactivity on playback) and frames the rest of the design around event delta streams.

Frequently Asked Questions

What is a session replay service and how does it work?

A session replay service records a user’s interactions with a web application — mouse movements, clicks, scroll events, and DOM mutations — and reconstructs a video-like playback for debugging and UX analysis. A lightweight JavaScript snippet captures events using the MutationObserver API and input listeners, serializes them as a structured event stream, and ships batches to an ingestion endpoint. On playback, the recorder’s initial DOM snapshot is restored and events are replayed in timestamp order to reproduce the session.

How do you capture DOM events on the client side without impacting performance?

The recorder uses a single MutationObserver at the document root and event delegation for interaction events rather than per-element listeners, minimizing listener overhead. Captured events are appended to an in-memory ring buffer and flushed to the server in compressed batches on a timer (e.g., every 2–5 seconds) or via requestIdleCallback to avoid blocking the main thread. Throttling is applied to high-frequency events like mousemove and scroll — only positional deltas exceeding a threshold are recorded. The payload is gzip-compressed before transmission to reduce network cost.

How do you mask PII in session replay recordings?

Masking operates at two layers. Client-side, the recorder respects CSS selectors or data attributes (e.g., data-private) to redact element content before serialization, replacing text nodes with placeholder characters and input values with empty strings. Server-side, an ingestion pipeline applies a secondary scan using pattern matching (regexes for card numbers, emails, SSNs) and ML classifiers to catch any PII that slipped through client-side rules. Masked fields are never stored in plaintext; the masking is irreversible so replays cannot be unmasked even by internal users.

How do you handle data retention and GDPR deletion for session replays?

Recordings are stored with a TTL field set at ingest time (e.g., 90 days for free tier, 1 year for enterprise). A nightly job hard-deletes or overwrites expired records in object storage and removes their index entries. For GDPR right-to-erasure requests, a deletion pipeline accepts a user identifier, looks up all associated session IDs in a user-to-session index, and issues delete operations against the object store and any search indexes. Completion is logged with a timestamp for audit purposes. To support rapid deletion, session metadata is stored separately from the raw event payload so user linkage can be severed without rewriting large blobs.