Clickstream Analytics Pipeline Low-Level Design: Event Ingestion, Stream Processing, and Session Stitching

Clickstream Analytics Pipeline: Low-Level Design

A clickstream analytics pipeline captures every user interaction on a web or mobile product, processes those interactions in real time, and makes the results available for both instant dashboards and deep ad-hoc analysis. The design spans event schema, client-side collection, server-side enrichment, stream processing, and long-term storage.

Event Schema

Every event carries a fixed envelope plus flexible properties:

  • event_id — UUID generated client-side, used for deduplication
  • user_id — authenticated user identifier (hashed for analytics to protect PII)
  • session_id — client-generated UUID reset after 30 minutes of inactivity
  • event_type — e.g., page_view, click, add_to_cart, purchase
  • page_url and referrer — current page and originating URL
  • timestamp — client-side ISO-8601 timestamp in UTC
  • properties{} — arbitrary key-value map for event-specific data (product_id, price, etc.)
  • device_info{} — browser, OS, screen resolution, language

Client-Side Collection

A lightweight JavaScript SDK is embedded on every page. It batches events in memory and flushes on a timer (every 5 seconds) or when the batch reaches 50 events. On page unload, it uses the Beacon API (navigator.sendBeacon) to fire the final batch without blocking navigation. The SDK generates event_id locally to enable server-side deduplication.

Server-Side Enrichment

The collection endpoint receives batches and enriches each event before publishing to Kafka:

  • Server timestamp — appended alongside the client timestamp to detect clock skew
  • Geo lookup — IP address resolved to country/region/city using MaxMind GeoLite2
  • User-agent parsing — browser family, OS, device type extracted server-side
  • PII stripping — raw IP is hashed; email or phone in properties is removed before forwarding

Kafka Ingestion

Enriched events are written to a Kafka topic with partitioning by user_id. Partitioning by user ensures all events for a given user arrive at the same consumer in order, which is essential for session stitching. The topic is retained for 7 days to allow reprocessing.

Stream Processing: Session Stitching

A Flink or Spark Structured Streaming job consumes from Kafka and reconstructs sessions:

  • Events are grouped by session_id using a session window with a 30-minute inactivity timeout
  • When the session closes, the processor emits a session record: session_id, user_id, start_time, end_time, duration_seconds, pages_per_session, event_count
  • Late-arriving events (up to 2 minutes) are accepted and merged into the session via Flink's allowed lateness

Funnel Computation

The stream processor also evaluates ordered conversion funnels defined by a sequence of event types. For example:

visited_product → added_to_cart → checkout → purchased

For each user session, the processor checks whether the required events appear in order within the session window. Conversion rates per funnel step are aggregated and written to a metrics store every minute. Dropped sessions (users who completed step 2 but not step 3) are tracked separately for dropout analysis.

Real-Time Aggregation

A second stream processor maintains real-time counters in Redis:

  • Page views per minuteINCR pv:{page_url}:{minute_bucket} with 10-minute TTL
  • Active users per page — sliding window HyperLogLog for unique user_id approximation
  • Event type counts — per event_type per minute for anomaly detection

These Redis counters feed the real-time dashboard with sub-second latency.

Data Lake Sink

Raw enriched events are written in parallel to S3 or GCS in Parquet format, partitioned by date/hour. Parquet's columnar compression reduces storage cost by 5-10x versus JSON. These files are queryable by Athena (AWS) or BigQuery (GCP) without any ETL, enabling ad-hoc analysis with standard SQL.

Deduplication

Because the Beacon API offers at-least-once delivery and network retries can repeat events, deduplication is critical. The stream processor maintains a Redis set of recently seen event_id values with a 1-hour TTL. On arrival, each event is checked: if the event_id already exists in the set, the event is dropped. Only new event_ids are processed and added to the set.

PII Handling

User privacy is enforced at multiple layers:

  • user_id is hashed (SHA-256 with a server-side salt) before being written to the analytics pipeline — the raw user_id never enters the data lake
  • Raw IP addresses are used only for geo enrichment and then discarded
  • Any email or phone number detected in properties{} is stripped by the enrichment layer before Kafka ingestion
  • GDPR deletion requests trigger a backfill job that nulls the hashed_user_id in Parquet partitions

Summary

The full pipeline flows from JS SDK → Beacon API → enrichment service → Kafka → Flink session stitching + funnel computation → Redis real-time counters and S3 Parquet data lake. Deduplication by event_id and PII hashing are enforced at ingestion time. This architecture supports both sub-second operational dashboards and petabyte-scale historical analysis with no schema migration overhead.

See also: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence

See also: Meta Interview Guide 2026: Facebook, Instagram, WhatsApp Engineering

See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering

Scroll to Top