Clickstream Analytics Pipeline: Low-Level Design
A clickstream analytics pipeline captures every user interaction on a web or mobile product, processes those interactions in real time, and makes the results available for both instant dashboards and deep ad-hoc analysis. The design spans event schema, client-side collection, server-side enrichment, stream processing, and long-term storage.
Event Schema
Every event carries a fixed envelope plus flexible properties:
- event_id — UUID generated client-side, used for deduplication
- user_id — authenticated user identifier (hashed for analytics to protect PII)
- session_id — client-generated UUID reset after 30 minutes of inactivity
- event_type — e.g.,
page_view,click,add_to_cart,purchase - page_url and referrer — current page and originating URL
- timestamp — client-side ISO-8601 timestamp in UTC
- properties{} — arbitrary key-value map for event-specific data (product_id, price, etc.)
- device_info{} — browser, OS, screen resolution, language
Client-Side Collection
A lightweight JavaScript SDK is embedded on every page. It batches events in memory and flushes on a timer (every 5 seconds) or when the batch reaches 50 events. On page unload, it uses the Beacon API (navigator.sendBeacon) to fire the final batch without blocking navigation. The SDK generates event_id locally to enable server-side deduplication.
Server-Side Enrichment
The collection endpoint receives batches and enriches each event before publishing to Kafka:
- Server timestamp — appended alongside the client timestamp to detect clock skew
- Geo lookup — IP address resolved to country/region/city using MaxMind GeoLite2
- User-agent parsing — browser family, OS, device type extracted server-side
- PII stripping — raw IP is hashed; email or phone in properties is removed before forwarding
Kafka Ingestion
Enriched events are written to a Kafka topic with partitioning by user_id. Partitioning by user ensures all events for a given user arrive at the same consumer in order, which is essential for session stitching. The topic is retained for 7 days to allow reprocessing.
Stream Processing: Session Stitching
A Flink or Spark Structured Streaming job consumes from Kafka and reconstructs sessions:
- Events are grouped by
session_idusing a session window with a 30-minute inactivity timeout - When the session closes, the processor emits a session record:
session_id,user_id,start_time,end_time,duration_seconds,pages_per_session,event_count - Late-arriving events (up to 2 minutes) are accepted and merged into the session via Flink's allowed lateness
Funnel Computation
The stream processor also evaluates ordered conversion funnels defined by a sequence of event types. For example:
visited_product → added_to_cart → checkout → purchased
For each user session, the processor checks whether the required events appear in order within the session window. Conversion rates per funnel step are aggregated and written to a metrics store every minute. Dropped sessions (users who completed step 2 but not step 3) are tracked separately for dropout analysis.
Real-Time Aggregation
A second stream processor maintains real-time counters in Redis:
- Page views per minute —
INCR pv:{page_url}:{minute_bucket}with 10-minute TTL - Active users per page — sliding window HyperLogLog for unique user_id approximation
- Event type counts — per event_type per minute for anomaly detection
These Redis counters feed the real-time dashboard with sub-second latency.
Data Lake Sink
Raw enriched events are written in parallel to S3 or GCS in Parquet format, partitioned by date/hour. Parquet's columnar compression reduces storage cost by 5-10x versus JSON. These files are queryable by Athena (AWS) or BigQuery (GCP) without any ETL, enabling ad-hoc analysis with standard SQL.
Deduplication
Because the Beacon API offers at-least-once delivery and network retries can repeat events, deduplication is critical. The stream processor maintains a Redis set of recently seen event_id values with a 1-hour TTL. On arrival, each event is checked: if the event_id already exists in the set, the event is dropped. Only new event_ids are processed and added to the set.
PII Handling
User privacy is enforced at multiple layers:
user_idis hashed (SHA-256 with a server-side salt) before being written to the analytics pipeline — the raw user_id never enters the data lake- Raw IP addresses are used only for geo enrichment and then discarded
- Any email or phone number detected in
properties{}is stripped by the enrichment layer before Kafka ingestion - GDPR deletion requests trigger a backfill job that nulls the hashed_user_id in Parquet partitions
Summary
The full pipeline flows from JS SDK → Beacon API → enrichment service → Kafka → Flink session stitching + funnel computation → Redis real-time counters and S3 Parquet data lake. Deduplication by event_id and PII hashing are enforced at ingestion time. This architecture supports both sub-second operational dashboards and petabyte-scale historical analysis with no schema migration overhead.
See also: Meta Interview Guide 2026: Facebook, Instagram, WhatsApp Engineering
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering