How does a real-time dashboard's metric ingestion pipeline work?

Producers (application servers, IoT devices, etc.) write metric events to a durable message bus (e.g., Kafka). An ingestion service consumes these events, validates and normalizes schemas, and writes them to a time-series store (e.g., InfluxDB or Druid) optimized for high-cardinality, high-write-throughput workloads. The bus decouples producers from the store, absorbing traffic spikes without back-pressuring upstream services.

How does sliding-window aggregation work in a real-time dashboard backend?

The backend maintains pre-computed aggregates (sum, count, min, max, percentiles) over a sliding window (e.g., last 5 minutes) per metric series. On each new data point the window adds the new value and removes expired values, updating the aggregate incrementally. Results are stored in a fast in-memory cache keyed by metric name and window bounds, so dashboard queries read from cache rather than scanning raw events.

Why use WebSocket push instead of polling for a real-time dashboard?

HTTP polling requires each client to send a request every N seconds regardless of whether data changed, creating unnecessary load and introducing up to N seconds of staleness. WebSocket maintains a persistent bidirectional connection: the server pushes updates only when aggregated metric values change, reducing both latency (sub-second delivery) and server load (no redundant request handling for quiet periods).

How does configurable widget rendering work in a real-time dashboard system?

Each dashboard layout is stored as a JSON document describing a grid of widget configurations. Each widget record specifies its type (line chart, gauge, heatmap, table), the metric query it binds to, refresh interval, display options (color scheme, axis scale, thresholds for color coding), and position/size in the grid. The frontend reads this schema and renders widgets independently, subscribing each to the appropriate WebSocket channel for its metric stream.

Real-Time Dashboard Low-Level Design: Metric Ingestion, Aggregation, and WebSocket Push

⏱ 6 min read

What Is a Real-Time Dashboard?

A real-time dashboard ingests a continuous stream of metric events, computes aggregations over sliding time windows, and pushes updated values to browser clients via WebSocket connections. Unlike polling-based dashboards, a push-based architecture ensures that all connected clients see consistent, low-latency updates without generating excessive repeated HTTP requests against the backend.

Requirements

Functional Requirements

Ingest metric events from producers via HTTP or a message queue.
Compute configurable aggregations (sum, average, max, count) over sliding time windows (e.g. last 1 minute, 5 minutes, 1 hour).
Push updated aggregates to subscribed WebSocket clients whenever a new computation cycle completes.
Support configurable widget definitions: each widget specifies a metric name, aggregation type, and time window.
Allow users to create, update, and delete dashboard layouts and widget configurations.

Non-Functional Requirements

Support 10,000 concurrent WebSocket connections per dashboard cluster.
Metric updates must reach connected clients within 2 seconds of the originating event.
The ingestion path must handle 100,000 metric events per second.
Aggregation state must survive a single backend node failure without data loss.

Data Model

Widget

widget_id (UUID)
dashboard_id (foreign key)
metric_name (string, e.g. “api.request.latency”)
aggregation (ENUM: sum, avg, max, min, count)
window_seconds (integer: sliding window duration)
refresh_interval_seconds (how often to push updates to clients)
display_config (JSONB: chart type, axis labels, color scheme)

MetricEvent

metric_name (string)
value (float)
tags (JSONB: key-value label map for filtering)
event_time (timestamp, producer-supplied)

Raw events are not stored long-term in a relational database. They are held in the sliding window buffer (in-memory or Redis sorted set) and discarded after the longest configured window expires.

Core Algorithms

Sliding Window Aggregation

For each (metric_name, aggregation, window_seconds) combination that any widget requires, the aggregation service maintains a sorted set in Redis keyed by event_time. On each ingestion:

Add the event value to the sorted set with score = event_time in milliseconds.
Trim entries with score less than (now – window_seconds * 1000) to evict expired data.
On the refresh cycle for each widget, compute the aggregate over the remaining members: ZRANGEBYSCORE from (now – window_seconds * 1000) to now, then apply the aggregation function to the returned values.

For sum and count, Redis ZRANGEBYSCORE with WITHSCORES can be avoided if a running total is maintained in a separate key and decremented as events expire. This reduces the per-refresh computation from O(n) to O(1) for sum and count aggregations.

WebSocket Push

Each WebSocket server node maintains a subscription map: metric_key to list of connection IDs. When an aggregation result is computed for a metric_key, the aggregation service publishes it to a Redis Pub/Sub channel named after the metric_key. All WebSocket server nodes subscribed to that channel receive the result and push it to their locally connected clients whose dashboard widgets subscribe to that metric. This fan-out pattern decouples aggregation workers from connection management.

Scalability

The ingestion path is horizontally scalable: events are accepted by stateless HTTP ingestion nodes and written to a Kafka topic partitioned by metric_name. Aggregation workers consume from Kafka, one worker per partition, maintaining per-partition window state in Redis. Adding Kafka partitions and aggregation workers scales ingestion throughput linearly.

WebSocket connections are stickied to individual server nodes by a layer-7 load balancer using the client ID cookie. Redis Pub/Sub ensures that metric updates are delivered to the correct node regardless of which aggregation worker computed them. For very large fan-outs (one metric subscribed to by thousands of dashboards), the Pub/Sub broadcast is augmented with a client-side subscription filter so each WebSocket node only delivers updates to clients with a matching widget.

API Design

POST /ingest — accept a batch of MetricEvent objects; acknowledge after writing to Kafka.
POST /dashboards — create a dashboard with an initial set of widget definitions.
PATCH /dashboards/{dashboard_id}/widgets — add, update, or remove widget configurations.
GET /dashboards/{dashboard_id}/widgets/{widget_id}/snapshot — return the current aggregated value on demand (for initial page load before WebSocket connects).
WS /stream/{dashboard_id} — WebSocket endpoint; server pushes metric update messages for all widgets on the dashboard at their configured refresh intervals.

Failure Modes

Redis unavailable: Aggregation falls back to in-memory state on the worker. State from other workers is not accessible, so aggregates may be incomplete. Alert and attempt Redis reconnection; do not serve stale values older than 2x the refresh interval.
WebSocket node crash: Clients reconnect via the load balancer to a surviving node. The node sends a snapshot of current values on connection establishment so clients see accurate data immediately rather than waiting for the next push cycle.
Kafka consumer lag: If aggregation workers fall behind, the sliding window state reflects a delayed view of reality. Expose consumer lag as a metric and alert when lag exceeds the shortest configured window duration, which would mean aggregates are based on incomplete data.

Observability

Track metric ingestion rate, Kafka consumer lag per partition, aggregation computation latency, WebSocket push latency, active connection count, and subscription fan-out ratio. Alert when end-to-end metric-to-client latency exceeds the SLA of 2 seconds across the 95th percentile of push events.