Low Level Design: Real-Time Analytics Platform

Introduction

Real-time analytics surfaces business metrics such as DAU, revenue, and conversion rates with sub-second query latency on continuously updating data. Unlike batch reporting that runs overnight, a real-time analytics platform must ingest high-throughput event streams, pre-aggregate results, and serve dashboards with live updates — all simultaneously.

Lambda Architecture

Lambda architecture combines two processing layers. The batch layer (Hadoop or Spark) reprocesses the full dataset periodically, producing accurate but high-latency results. The speed layer (Flink or Storm) processes the real-time stream and produces approximate, low-latency results. A query merges results from both layers to answer requests. The main complexity is maintaining two separate codepaths that must produce semantically equivalent results. Kappa architecture simplifies this by eliminating the batch layer entirely, using only the streaming layer backed by a replayable Kafka log.

Kappa Architecture

Kappa architecture uses a single streaming pipeline with Kafka as the durable, replayable log. All processing runs in Flink or Spark Streaming. Historical reprocessing is achieved by replaying Kafka from offset 0 into a new parallel job, allowing the new job to catch up before the old one is retired. Operations are simpler because only one codebase exists. The trade-off is higher Kafka storage cost, since logs must be retained long enough to support full reprocessing.

Stream Ingestion

Events — page views, clicks, and transactions — are published to Kafka partitioned by event_type and user_id to ensure ordering within a user session. Each analytics use case has its own consumer group, enabling independent consumption rates. A Flink job consumes events, applies tumbling windows (1-minute, 5-minute) and sliding windows (1-hour with 5-minute slide) to compute aggregates, and writes the results to the OLAP store. Backpressure is managed via Kafka consumer lag monitoring; additional Flink task slots are provisioned when lag exceeds thresholds.

OLAP Store Options

Several OLAP stores serve different query patterns. Apache Druid supports real-time ingestion from Kafka, is column-oriented, and performs pre-aggregated rollups at ingestion time, enabling sub-second query latency. ClickHouse offers extremely high insert throughput, excellent compression ratios, and a familiar SQL interface, making it popular for event analytics. Apache Pinot provides low-latency queries and upsert support for mutable event records. The choice among them depends on query patterns: Druid for time-series rollups, ClickHouse for analytical SQL, Pinot for upsert-heavy workloads.

Pre-Aggregation

Common aggregates — DAU, hourly revenue, per-country breakdown — are pre-computed at ingestion time rather than at query time. Flink writes these aggregates into materialized rollup tables in the OLAP store. Dashboard queries read from rollup tables and return in milliseconds. Raw events are retained in a separate store for ad-hoc analyst queries that require arbitrary filtering or grouping not covered by rollups. The pre-aggregation strategy is defined per metric family and encoded in the Flink job configuration.

Query Engine

A query routing layer sits between dashboards and data stores. Dashboard queries for pre-defined metrics are routed to rollup tables in the OLAP store for maximum speed. Ad-hoc analyst queries with arbitrary dimensions hit the raw event store with result size limits enforced. The routing decision is based on time range and dimension cardinality: queries covering more than 90 days or involving high-cardinality dimensions without rollup coverage go to the raw store. A Redis cache with a 30-second TTL stores results for repeated dashboard queries, dramatically reducing OLAP query volume during peak traffic.

Dashboard Serving

An API server reads metrics from the OLAP store and caches results keyed by (dashboard_id, time_range, filters). Live dashboards receive updates via Server-Sent Events (SSE) or WebSockets: the server pushes a new metric snapshot whenever the underlying data changes. Alert rules are evaluated on each metric update; a threshold crossing triggers a notification through the configured channel (email, Slack, PagerDuty). Alert state is tracked to avoid re-notifying on every update — an alert fires once on transition from OK to FIRING and resolves once on transition back to OK.

Frequently Asked Questions: Real-Time Analytics Platform Design

What are the tradeoffs between Lambda and Kappa architecture for real-time analytics?

Lambda architecture runs parallel batch and stream processing layers, giving you accurate historical recomputation at the cost of maintaining two separate codebases and reconciling results. Kappa architecture eliminates the batch layer entirely — all processing goes through a single stream pipeline (typically Kafka + Flink), which simplifies operations but requires your streaming logic to handle reprocessing from log replay when corrections are needed. Lambda is preferred when batch accuracy is non-negotiable and reprocessing latency is acceptable; Kappa wins when operational simplicity and low end-to-end latency matter more than occasional reprocessing overhead.

How does Apache Druid compare to ClickHouse for a real-time analytics platform?

Druid is purpose-built for sub-second OLAP on high-cardinality, time-series event data — it ingests directly from Kafka, pre-segments data by time, and uses bitmap indexes for fast dimension filtering. ClickHouse is a columnar SQL database optimized for batch-inserted analytical workloads and excels at complex aggregations over large historical datasets with its MergeTree engine and vectorized execution. For real-time dashboards where data freshness under one second matters (ad-tech, observability), Druid is typically the better fit. For retrospective analytics with complex SQL and large scans, ClickHouse usually wins on throughput and cost per query.

When should you use pre-aggregation versus query-time computation in a real-time analytics system?

Pre-aggregation (materialized rollups, Druid ingestion-time metrics, summary tables) trades storage and ingestion CPU for dramatically faster query response — ideal when query patterns are predictable, cardinality is bounded, and dashboards must serve thousands of concurrent users with p99 < 100ms. Query-time computation preserves full granularity and supports ad-hoc analysis but scales poorly under heavy concurrency. A practical hybrid: pre-aggregate for known dashboard KPIs along common dimensions; fall back to raw event scans for exploratory queries, protected by query timeouts and result-set limits.

What is the difference between Flink tumbling windows and sliding windows?

A tumbling window is a fixed-size, non-overlapping time bucket — for example, one window per minute with no overlap. Each event belongs to exactly one window. A sliding window has a fixed size but advances by a smaller slide interval — for example, a 5-minute window that advances every 1 minute, meaning each event belongs to multiple overlapping windows. Tumbling windows are cheaper (one state entry per window) and appropriate for independent aggregation periods. Sliding windows give smoother, more responsive metrics (useful for moving averages and rate-of-change signals) at the cost of higher state and computation proportional to window_size / slide_interval.

How should dashboard query caching with Redis TTL be designed for a real-time analytics platform?

Cache query results in Redis keyed by a hash of the normalized query parameters (metric name, dimensions, time range, granularity). Set TTL proportional to acceptable data staleness — typically 10–60 seconds for near-real-time dashboards. Use shorter TTLs for the most recent time windows (where data is still arriving) and longer TTLs for historical ranges that are fully settled. On a cache miss, execute the query against Druid or ClickHouse, write the result back with the appropriate TTL, and return it. Invalidate proactively on known data updates if your pipeline emits completion signals. Avoid caching unbounded or highly unique queries that will never be re-requested — add a query complexity gate before writing to cache.