Real-Time Analytics Pipeline: Low-Level Design

⏱ 4 min read

A real-time analytics pipeline processes streams of events and makes the results available for querying within seconds — enabling live dashboards, instant fraud detection, and real-time personalization. Unlike batch processing (which processes all data at day-end), streaming analytics trades some completeness for immediacy. The architecture involves event ingestion, stream processing, a query layer, and a serving store.

Event Ingestion Layer

Events (page views, clicks, transactions, sensor readings) arrive at high volume and must be captured without data loss. Kafka is the standard ingestion buffer: events are written to Kafka topics; the high durability (replication factor 3) and ordered-within-partition guarantees make Kafka suitable as the source of truth for event streams. Producer design: batch events on the client side (web browser or mobile app) — send a batch every 5 seconds or when 50 events accumulate. Batching reduces network overhead by 10-50x. Kinesis Data Streams (AWS) is an alternative to Kafka for teams on AWS — managed, similar semantics. Event schema: use a schema registry (Confluent Schema Registry, AWS Glue) with Avro or Protobuf schemas. Schema evolution: backward-compatible changes (adding optional fields) are safe; breaking changes require versioning. Strong schemas prevent schema drift that corrupts downstream processing. Partitioning strategy: partition by user_id (all events from a user land in the same partition — enables per-user session reconstruction) or by event_type (all click events in one partition — enables type-specific consumers without reading irrelevant events).

Stream Processing

Stream processors consume events from Kafka, apply transformations and aggregations, and write results to a serving store. Apache Flink is the leading stateful stream processing framework: low latency (milliseconds), exactly-once semantics, and rich windowing operations. Apache Spark Structured Streaming is simpler but has higher latency (seconds). Key stream processing operations: (1) Filtering: discard events that don’t match criteria (e.g., only process events from paying users). (2) Enrichment: join the event stream with a side table (user profile from Redis or a broadcast join) to add context. A page_view event enriched with user_country, user_plan. (3) Windowed aggregations: count page views per URL per 5-minute tumbling window. Sliding window: count events in the last 60 minutes, updated every minute. Session window: group events from the same user where gaps between events are < 30 minutes. (4) Pattern detection: CEP (complex event processing) — detect when event A is followed by event B within 60 seconds (e.g., add-to-cart followed by checkout). Flink's CEP library supports this natively.

Watermarks and Late-Arriving Events

Events don’t arrive at the processor in the order they occurred — network delays, mobile offline buffering, and batching cause late arrivals. A page view that happened at 12:00:30 may arrive at the processor at 12:01:15. If you close the 12:00-12:01 window at 12:01, you miss this event. Watermarks: a watermark is a declaration that all events up to timestamp T have arrived (with configurable lateness tolerance). The processor closes a time window only when the watermark passes the window end. Watermark = current_event_time – allowed_lateness. With 30-second allowed lateness: the 12:00-12:01 window stays open until 12:01:30. Events arriving after the watermark (very late events) are handled per policy: drop them (default), update the result and emit a correction, or route to a side output for separate handling. Trade-off: more allowed lateness = more complete results, higher result latency. For fraud detection (latency-critical): use 5-second allowed lateness. For daily reporting: use 15-minute allowed lateness.

Serving Store and Query Layer

Stream processing results are written to a serving store optimized for real-time queries. Options by use case: (1) Redis: for metrics dashboards and counter-based queries. Store aggregated results as Redis hashes or sorted sets. INCR page_views:{url}:{minute_bucket} on each event; read by the dashboard. Sub-millisecond reads. (2) Apache Druid: a columnar OLAP database optimized for time-series analytics queries — “total page views by country for the last 3 hours, broken down by hour”. Druid ingests from Kafka directly (real-time ingestion) and supports sub-second queries on billions of rows. Used by Netflix, Airbnb for real-time analytics. (3) ClickHouse: a column-oriented database with excellent aggregation performance. INSERT events directly from Kafka (via Kafka engine tables); query with SQL. (4) Apache Pinot: similar to Druid, optimized for user-facing analytics with consistent low-latency responses. Used by LinkedIn for real-time member analytics.

Lambda and Kappa Architecture

Lambda architecture combines batch and stream processing: (1) Speed layer: stream processing (Flink) produces approximate real-time results — fast but may have slight inaccuracies (late events, deduplication issues). (2) Batch layer: periodic batch job (Spark, BigQuery) reprocesses all historical events and produces accurate results. (3) Serving layer: merges speed layer and batch layer results — the batch layer corrects the speed layer’s approximations. Complexity: maintaining two processing pipelines (batch and streaming) with equivalent logic is expensive. Kappa architecture simplifies this: process everything as a stream using a durable, replayable event log (Kafka with long retention). When you need to reprocess (bug fix, new metric): replay the Kafka log from the beginning through the stream processor. No separate batch layer. Modern systems favor Kappa for its simplicity. Flink’s exactly-once semantics and Kafka’s log replay make Kappa viable for most analytics use cases that previously required Lambda.