Question 1

What is the Lambda architecture for analytics?

Accepted Answer

Lambda architecture uses two parallel processing layers: the Speed Layer processes events in real time (last few hours/days) using stream processing (Flink, Kafka Streams), serving low-latency queries with slight approximation. The Batch Layer reprocesses all historical data periodically (hourly/daily) using batch processing (Spark), producing accurate but higher-latency results stored in a data warehouse. The Serving Layer merges results from both layers. Use Lambda when: you need real-time metrics AND accurate historical analysis, the two have different latency/accuracy trade-offs. Kappa architecture alternative: process everything through a single stream layer with long-retention Kafka. Simpler to operate (no dual code paths) but harder to implement complex historical analyses that require multiple passes over data.

Question 2

How does Kafka act as the backbone of an analytics pipeline?

Accepted Answer

Kafka is the central event bus that decouples producers (apps sending events) from consumers (analytics pipelines). Events are written once by producers and consumed independently by multiple consumers: stream processor (Flink/Kafka Streams) for real-time aggregations, S3 sink connector for raw event archival, data warehouse loader for batch processing, alerting service for anomaly detection. Kafka retains events for configurable retention (7 days default, longer for replay). On consumer failure: re-read from the last committed offset. Partitioning by user_id ensures all events for a user are ordered within a partition, enabling sessionization. Multiple consumer groups allow the same events to feed independent pipelines without coupling.

Question 3

How do you count unique users at scale without storing all user IDs?

Accepted Answer

HyperLogLog (HLL) approximates distinct counts with ~1% error using O(log log n) memory — for 1 billion users, an HLL sketch uses ~12KB vs ~1GB for an exact bitmap. Redis PFADD event_hll:{event_type}:{hour} {user_id} adds a user to the HLL. PFCOUNT event_hll:{event_type}:{hour} returns the approximate unique count. PFMERGE event_hll:{event_type}:{day} hour1 hour2 ... hour24 combines hourly sketches into a daily unique count. Error: ~0.81% standard error — acceptable for dashboards. For exact counts (required for billing, compliance): use Redis BITSET if user IDs are dense integers (SETBIT bitcount:{day} {user_id} 1; BITCOUNT bitcount:{day}). 1 billion users = 125MB bitset — feasible.

Question 4

Why is a columnar database (Druid, ClickHouse, BigQuery) necessary for analytics?

Accepted Answer

OLTP databases (PostgreSQL, MySQL) store data row-by-row — efficient for fetching complete records by primary key. Analytics queries aggregate across millions of rows but only need a few columns: SELECT country, COUNT(*) FROM events WHERE event_type='purchase' GROUP BY country. Row storage must read all columns even though only 2 are needed. Columnar storage (Druid, ClickHouse, BigQuery) stores each column as a contiguous block. The query reads only the event_type and country columns — skipping all other columns. Columnar storage also enables vectorized execution (SIMD instructions process 8-16 values per CPU instruction) and excellent compression (sorted column values compress 10-50x). Result: 10-100x faster analytical queries than row-based storage for the same data.

Question 5

How do you implement session analysis in a real-time analytics pipeline?

Accepted Answer

Sessionization groups user events into sessions separated by inactivity gaps (typically 30 minutes). Stream-based sessionization: use Flink session windows (gap-based: session ends after 30min of no events for that user). Each session produces a session record: (user_id, session_id, start_time, end_time, event_count, pages_viewed[], first_event, last_event). Challenge: sessions can span multiple partitions' time windows. Solution: partition Kafka by user_id so all events for a user go to the same stream processor node. The processor maintains per-user state (last event time, current session). Emit the session record when the gap timer fires. Late events (network delay): use watermarks — process events up to 5 seconds late, then close the window.

Real-Time Analytics Platform Low-Level Design

Requirements

Lambda Architecture

Event Ingestion

Stream Processing (Flink / Kafka Streams)

Real-Time Serving (Druid / ClickHouse)

Batch Processing (Data Warehouse)

HyperLogLog for Unique User Counts

Dashboard Real-Time Updates

Key Design Decisions