Real-Time Analytics Platform Low-Level Design

Requirements

  • Ingest billions of events per day (page views, clicks, transactions, errors)
  • Support real-time dashboards with <10 second lag
  • Support ad-hoc historical queries (e.g., weekly active users, funnel analysis)
  • 1M events/second at peak

Lambda Architecture

Events → Kafka → Stream Layer  → Real-time serving layer (Druid/Redis)
                             → Batch Layer    → Historical DWH (BigQuery/Snowflake)
                                              → Serving layer (materialized views)

Lambda architecture: stream layer for real-time (last 24h), batch layer for historical (all-time). Kappa architecture alternative: process everything through the stream layer with long retention Kafka. Simpler to operate but requires streaming SQL for complex aggregations.

Event Ingestion

Event(event_id UUID, event_type, user_id, session_id, properties JSON,
      client_timestamp, server_timestamp, app_version, platform)

Client SDKs batch events locally (10 events or 5 seconds, whichever first) and send to the Collector API. Collector validates, enriches (add server_timestamp, geolocation from IP, device type from user-agent), and publishes to Kafka. Return 200 immediately — no processing in the hot path. Client SDK uses exponential backoff retry on failure. Kafka topic partitioned by user_id for ordered per-user processing.

Real-time aggregations computed on the stream:

  • Windowed counts: COUNT events per (event_type, 1-minute window). Write to Druid or ClickHouse.
  • Unique users: HyperLogLog approximation (Redis PFADD per (event_type, hour)) for distinct user counts without storing all user IDs.
  • Funnel steps: sessionization — group events by session_id within a 30-minute window. For each session, track which funnel steps were completed.
  • Anomaly detection: compare current metric against rolling average; alert on >3 standard deviation deviation.

Real-Time Serving (Druid / ClickHouse)

Druid: append-only columnar database optimized for time-series aggregation queries. Ingests from Kafka in real time, serves sub-second GROUP BY queries on billions of rows. Schema: timestamp, dimensions (event_type, platform, country), metrics (count, sum, cardinality). Query: SELECT COUNT(*) FROM events WHERE __time BETWEEN NOW()-1h AND NOW() AND event_type=’page_view’ GROUP BY platform.

ClickHouse: similar but simpler to operate. MergeTree engine, columnar storage, vectorized query execution. Excellent for ad-hoc analytics queries.

Batch Processing (Data Warehouse)

Kafka → S3 (raw event archive, Parquet format, partitioned by date) → Spark/dbt → BigQuery/Snowflake. Batch jobs run hourly. Produce: daily active users (DAU), weekly active users (WAU), retention cohorts, conversion funnels, revenue attribution. These require complex joins and multi-day windows not suited for streaming. Results stored in DWH tables, queryable by data analysts via SQL and BI tools (Looker, Superset).

HyperLogLog for Unique User Counts

Exact distinct counts require storing all user IDs — O(n) memory. HyperLogLog (HLL) approximates distinct counts with ~1% error using O(log log n) memory. Redis PFADD adds elements to an HLL sketch; PFCOUNT returns the approximate cardinality. PFMERGE combines sketches (for combining hourly → daily). Use HLL for: daily active users, weekly active users, unique visitors per page. When exact counts are required (billing), use a bitmap (Redis BITSET) for known user ID ranges.

Dashboard Real-Time Updates

Dashboards poll every 10 seconds via REST or subscribe via WebSocket to a metrics stream. The real-time serving layer (Druid/ClickHouse) handles concurrent dashboard queries efficiently due to columnar storage and pre-aggregated segments. Pre-compute common dashboard queries on a schedule (every minute) and cache results in Redis — dashboards serve from cache, reducing query load. Alert thresholds monitored by a separate alerting service that queries the stream layer every minute.

Key Design Decisions

  • Kafka as the central event bus — decouples ingestion from all downstream processing
  • Lambda architecture: stream for real-time (Druid), batch for historical (DWH)
  • HyperLogLog for approximate unique counts at scale — exact counting requires too much memory
  • Columnar storage (Druid, ClickHouse, BigQuery) is essential for aggregation query performance on billions of rows
  • Client-side batching reduces ingestion API load by 10-100x vs per-event HTTP calls

Databricks system design covers real-time analytics pipelines and streaming. See common questions for Databricks interview: real-time analytics and data pipeline design.

Twitter system design covers real-time analytics at massive scale. Review design patterns for Twitter/X interview: real-time analytics and event pipeline design.

LinkedIn system design covers analytics platforms and data pipelines. See design patterns for LinkedIn interview: analytics platform and data pipeline system design.

Scroll to Top