What Is a Real-Time Analytics Dashboard?
A real-time analytics dashboard displays live metrics and aggregations over streaming data: active users, revenue per minute, error rates, conversion funnels. Examples: Datadog dashboards, Stripe radar, Google Analytics real-time view. Core challenges: ingesting high-volume event streams, computing aggregations in near-real-time (seconds, not minutes), and serving many concurrent dashboard viewers efficiently.
System Requirements
Functional
- Ingest user events (page views, clicks, purchases, errors)
- Display real-time metrics: active users (last 5 min), events/second, revenue/hour
- Time-series charts with 1-minute granularity for the last 24 hours
- Filter by dimensions: country, device type, product category
- Anomaly highlighting: metric deviating more than 2 std devs from baseline
Non-Functional
- 1M events/second ingestion
- Dashboard refresh every 10 seconds
- Query latency <500ms for a 24-hour time-series query
Architecture
Events ──► Kafka ──► Flink (streaming aggregation)
│
┌──────┴──────────┐
▼ ▼
Redis (hot data) ClickHouse/Druid
last 5 min (historical, dimensional)
│ │
└──────┬──────────┘
▼
Query Service ──► Dashboard (WebSocket)
Event Ingestion
Client SDKs batch events (50ms batches) and POST to an ingestion service. The ingestion service validates, enriches (add server timestamp, geo from IP, device parse from User-Agent), and produces to Kafka. Kafka partitioned by user_id ensures per-user event ordering. 1M events/sec at 500 bytes/event = 500 MB/sec into Kafka — needs ~50 partitions across a 10-node cluster.
Stream Processing with Flink
Flink jobs consume from Kafka and maintain windowed aggregations:
# Tumbling window: count events per minute
stream
.key_by(lambda e: (e.event_type, e.country))
.window(TumblingEventTimeWindows.of(Time.minutes(1)))
.aggregate(CountAggregate())
.add_sink(RedisSink())
# Sliding window: active users last 5 minutes
stream
.key_by(lambda e: e.user_id)
.window(SlidingEventTimeWindows.of(Time.minutes(5), Time.minutes(1)))
.aggregate(UniqueCountAggregate(HyperLogLog))
.add_sink(RedisSink())
Hot Data in Redis
Flink writes aggregated results to Redis every minute (or every 10 seconds for near-real-time metrics). Data structures:
- Active users: HyperLogLog per minute bucket (low memory, approximate unique count)
- Event counts: Redis hash keyed by (event_type, minute_bucket)
- Revenue: Redis sorted set by timestamp for time-series
Redis holds 24 hours of per-minute data. At 1440 minutes/day * 50 metric combinations = 72K keys. Each key ~100 bytes = 7 MB — trivial.
Historical Data in ClickHouse
For queries spanning days/weeks: events land in ClickHouse via Kafka consumer. ClickHouse uses a columnar engine with pre-aggregated materialized views. A query for “hourly revenue for the last 30 days” scans a pre-aggregated hourly rollup table rather than raw events. Query time: <500ms for a 30-day hourly rollup across millions of rows.
Dashboard Serving
Dashboards use WebSocket (persistent connection). On connect: serve the last 24 hours of time-series from Redis (fast) and ClickHouse (for older data). Then push updates every 10 seconds: just the latest minute’s metrics from Redis. This keeps update payloads tiny (delta only). For 10K concurrent dashboard users: fan out via Redis pub/sub to connection servers.
Approximate vs Exact Counts
Counting distinct active users exactly requires storing all user IDs — O(N) memory. HyperLogLog approximates unique count in O(1) memory (12KB for any N) with ~2% error. For 1M events/sec with high cardinality, HyperLogLog is the standard choice. Display as “~1.2M active users” — dashboards are used for trends, not billing; approximation is acceptable.
Interview Tips
- Lambda architecture: Flink for real-time, ClickHouse for historical — name both.
- HyperLogLog for approximate unique counts — 12KB vs gigabytes for exact.
- Redis for hot data (last 24h), columnar DB for cold (historical).
- WebSocket delta updates: push only the latest minute, not the full 24h on each refresh.
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How does HyperLogLog enable counting millions of unique users with kilobytes of memory?”,
“acceptedAnswer”: { “@type”: “Answer”, “text”: “Counting distinct users exactly requires storing all user IDs — O(N) memory (80MB for 10M users). HyperLogLog (HLL) is a probabilistic data structure that approximates unique count in 12KB with ~2% error, regardless of N. Algorithm insight: for uniformly random hash values, the maximum number of leading zeros in the hashes of all seen elements indicates the cardinality. If the longest run of leading zeros is k, there are approximately 2^k distinct elements. Averaging over multiple hash functions (HLL uses 16,384 registers) reduces variance to ~2%. Operations: HLL.add(element) in O(1), HLL.count() in O(1), HLL.merge(other_hll) in O(1). Use Redis PFADD and PFCOUNT natively. Trade-off: the 2% error means "1,000,000 active users" might display as "980,000 to 1,020,000." For dashboards tracking trends, this is acceptable. For billing (charge per unique user), use exact counting. Redis HyperLogLog uses 12KB per key regardless of cardinality — you can track billions of unique events with trivial memory.” }
},
{
“@type”: “Question”,
“name”: “What is the difference between tumbling, sliding, and session windows in stream processing?”,
“acceptedAnswer”: { “@type”: “Answer”, “text”: “Window types determine how streaming data is grouped for aggregation. Tumbling window: fixed-size, non-overlapping. A 1-minute tumbling window groups events into [0:00-1:00), [1:00-2:00), etc. Each event belongs to exactly one window. Use for: per-minute metrics, hourly reports. Sliding window: fixed-size but overlapping, defined by window length and slide interval. A 5-minute window sliding every 1 minute: [0:00-5:00), [1:00-6:00), [2:00-7:00). Each event may belong to multiple windows. Use for: "active users in the last 5 minutes" (a common dashboard metric). More compute-intensive: each event is processed window_length/slide_interval times. Session window: groups events by user activity — a session starts on the first event and ends after a gap of N seconds with no events. Session length varies. Use for: user session analytics, funnel analysis. Flink and Spark Streaming support all three natively. For real-time dashboards: tumbling for discrete metrics (revenue this minute), sliding for rolling metrics (active users last N minutes).” }
},
{
“@type”: “Question”,
“name”: “How do you design a query layer that serves both real-time and historical dashboard data?”,
“acceptedAnswer”: { “@type”: “Answer”, “text”: “The query layer must unify data from two stores with different latencies: Redis (hot, last 24h, sub-millisecond) and ClickHouse (cold, weeks/months, 100ms-500ms). Router logic: for time range within the last 24 hours, serve from Redis. For time range older than 24 hours, serve from ClickHouse. For queries spanning both (e.g., last 48 hours), split: fetch recent portion from Redis, historical portion from ClickHouse, merge and de-duplicate on the boundary minute. Cache layer: ClickHouse query results are cached in Redis with 5-minute TTL — dashboards refreshing every 10 seconds hit the cache after the first load. Pre-aggregation: ClickHouse materialized views compute hourly and daily rollups automatically. A 30-day daily rollup query scans 30 rows, not 43M raw events. The query layer exposes a unified API: GET /metrics?start=T1&end=T2&granularity=1m&dimensions=country,device. The routing to Redis vs ClickHouse is transparent to the client.” }
}
]
}