Clickstream Analytics Pipeline: Low-Level Design
A clickstream analytics pipeline captures every user interaction on a web or mobile product, processes those interactions in real time, and makes the results available for both instant dashboards and deep ad-hoc analysis. The design spans event schema, client-side collection, server-side enrichment, stream processing, and long-term storage.
Event Schema
Every event carries a fixed envelope plus flexible properties:
- event_id — UUID generated client-side, used for deduplication
- user_id — authenticated user identifier (hashed for analytics to protect PII)
- session_id — client-generated UUID reset after 30 minutes of inactivity
- event_type — e.g.,
page_view,click,add_to_cart,purchase - page_url and referrer — current page and originating URL
- timestamp — client-side ISO-8601 timestamp in UTC
- properties{} — arbitrary key-value map for event-specific data (product_id, price, etc.)
- device_info{} — browser, OS, screen resolution, language
Client-Side Collection
A lightweight JavaScript SDK is embedded on every page. It batches events in memory and flushes on a timer (every 5 seconds) or when the batch reaches 50 events. On page unload, it uses the Beacon API (navigator.sendBeacon) to fire the final batch without blocking navigation. The SDK generates event_id locally to enable server-side deduplication.
Server-Side Enrichment
The collection endpoint receives batches and enriches each event before publishing to Kafka:
- Server timestamp — appended alongside the client timestamp to detect clock skew
- Geo lookup — IP address resolved to country/region/city using MaxMind GeoLite2
- User-agent parsing — browser family, OS, device type extracted server-side
- PII stripping — raw IP is hashed; email or phone in properties is removed before forwarding
Kafka Ingestion
Enriched events are written to a Kafka topic with partitioning by user_id. Partitioning by user ensures all events for a given user arrive at the same consumer in order, which is essential for session stitching. The topic is retained for 7 days to allow reprocessing.
Stream Processing: Session Stitching
A Flink or Spark Structured Streaming job consumes from Kafka and reconstructs sessions:
- Events are grouped by
session_idusing a session window with a 30-minute inactivity timeout - When the session closes, the processor emits a session record:
session_id,user_id,start_time,end_time,duration_seconds,pages_per_session,event_count - Late-arriving events (up to 2 minutes) are accepted and merged into the session via Flink's allowed lateness
Funnel Computation
The stream processor also evaluates ordered conversion funnels defined by a sequence of event types. For example:
visited_product → added_to_cart → checkout → purchased
For each user session, the processor checks whether the required events appear in order within the session window. Conversion rates per funnel step are aggregated and written to a metrics store every minute. Dropped sessions (users who completed step 2 but not step 3) are tracked separately for dropout analysis.
Real-Time Aggregation
A second stream processor maintains real-time counters in Redis:
- Page views per minute —
INCR pv:{page_url}:{minute_bucket}with 10-minute TTL - Active users per page — sliding window HyperLogLog for unique user_id approximation
- Event type counts — per event_type per minute for anomaly detection
These Redis counters feed the real-time dashboard with sub-second latency.
Data Lake Sink
Raw enriched events are written in parallel to S3 or GCS in Parquet format, partitioned by date/hour. Parquet's columnar compression reduces storage cost by 5-10x versus JSON. These files are queryable by Athena (AWS) or BigQuery (GCP) without any ETL, enabling ad-hoc analysis with standard SQL.
Deduplication
Because the Beacon API offers at-least-once delivery and network retries can repeat events, deduplication is critical. The stream processor maintains a Redis set of recently seen event_id values with a 1-hour TTL. On arrival, each event is checked: if the event_id already exists in the set, the event is dropped. Only new event_ids are processed and added to the set.
PII Handling
User privacy is enforced at multiple layers:
user_idis hashed (SHA-256 with a server-side salt) before being written to the analytics pipeline — the raw user_id never enters the data lake- Raw IP addresses are used only for geo enrichment and then discarded
- Any email or phone number detected in
properties{}is stripped by the enrichment layer before Kafka ingestion - GDPR deletion requests trigger a backfill job that nulls the hashed_user_id in Parquet partitions
Summary
The full pipeline flows from JS SDK → Beacon API → enrichment service → Kafka → Flink session stitching + funnel computation → Redis real-time counters and S3 Parquet data lake. Deduplication by event_id and PII hashing are enforced at ingestion time. This architecture supports both sub-second operational dashboards and petabyte-scale historical analysis with no schema migration overhead.
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How are clickstream events collected from the browser without losing data on page unload?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The Navigator.sendBeacon() API sends a small POST payload asynchronously after the page is dismissed, guaranteeing delivery even when the document is being unloaded — unlike XHR or fetch which are cancelled on navigation. Events buffered in memory are flushed via sendBeacon on visibilitychange (hidden) and beforeunload to cover both tab switches and full page exits.”
}
},
{
“@type”: “Question”,
“name”: “How is session stitching implemented in a stream processor?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Session stitching groups events into sessions by keying a stateful stream operator (e.g., Flink KeyedProcessFunction) on user or device ID and using a session gap window that closes after a configurable inactivity period (typically 30 minutes). The processor maintains session state in RocksDB-backed managed state, emitting a completed session record downstream when the gap timer fires.”
}
},
{
“@type”: “Question”,
“name”: “How are funnel conversion rates computed from clickstream events?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Funnel analysis is implemented as a sequence pattern query: events are grouped by user ID and ordered by timestamp, then a SQL MATCH_RECOGNIZE or window-based aggregation checks whether each required step occurs in order within a defined time window (e.g., view → add-to-cart → purchase within 24 hours). Conversion rate is the ratio of users who completed step N to those who completed step 1.”
}
},
{
“@type”: “Question”,
“name”: “How is PII handled in a clickstream pipeline?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “PII fields (e.g., email, IP address) are pseudonymized at the collection edge using HMAC-SHA256 with a rotating secret before events enter the pipeline, so raw identifiers never reach the data lake. Access to the mapping table between pseudonyms and real identities is restricted by IAM policy, and data retention policies automatically purge or further anonymize records beyond a defined window to satisfy GDPR right-to-erasure requirements.”
}
}
]
}
See also: Meta Interview Guide 2026: Facebook, Instagram, WhatsApp Engineering
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering