Low Level Design: Metrics Aggregation Service

What Is a Metrics Aggregation Service?

A metrics aggregation service collects raw telemetry data (counters, gauges, histograms) emitted by application instances and infrastructure components, aggregates it over configurable time windows, and stores the results for querying and visualization. Think of it as the backbone behind dashboards in systems like Datadog, Prometheus, or InfluxDB. At scale, a single high-traffic service may emit millions of metric data points per second, so the aggregation layer must be both high-throughput and low-latency.

Data Model

The core entities are metrics (named measurements) and data points (timestamped values). Aggregated rollups are stored separately to avoid recomputing them on every read.

-- Raw ingestion table (time-series store or partitioned RDBMS)
CREATE TABLE metric_points (
    metric_name   VARCHAR(255)  NOT NULL,
    labels        JSONB,                         -- e.g. {host: web-01, region: us-east}
    value         DOUBLE PRECISION NOT NULL,
    timestamp     BIGINT        NOT NULL,         -- epoch milliseconds
    PRIMARY KEY (metric_name, timestamp, labels)
) PARTITION BY RANGE (timestamp);

-- Pre-aggregated rollups
CREATE TABLE metric_rollups (
    metric_name   VARCHAR(255)  NOT NULL,
    labels        JSONB,
    window_start  BIGINT        NOT NULL,
    window_end    BIGINT        NOT NULL,
    resolution    VARCHAR(10)   NOT NULL,         -- '1m', '5m', '1h'
    count         BIGINT,
    sum           DOUBLE PRECISION,
    min           DOUBLE PRECISION,
    max           DOUBLE PRECISION,
    p99           DOUBLE PRECISION,
    PRIMARY KEY (metric_name, window_start, resolution, labels)
);

Core Algorithm and Workflow

Ingestion uses a push model. Clients call a lightweight UDP or HTTP endpoint, keeping client-side overhead minimal. The pipeline has four stages:

Receive: A fleet of stateless ingestion nodes accept metric payloads and write them to a partitioned message queue (e.g., Kafka topic keyed by metric name).
Buffer: A stream processor (Flink or a custom consumer) accumulates data points into in-memory hash maps keyed by (metric_name, label_set, window).
Aggregate: At window close (e.g., every 60 seconds), the processor flushes counts, sums, min/max, and approximate percentiles (t-digest or HdrHistogram) into the rollup table.
Serve: A query layer reads rollups at the appropriate resolution and merges partial windows if needed.

Failure Handling and Reliability

The biggest risks are data loss during ingestion and duplicate aggregation after a crash.

At-least-once delivery: Kafka retains raw points for 24-48 hours, allowing replay if a consumer crashes mid-window.
Idempotent writes: Rollup rows use upsert semantics (INSERT ... ON CONFLICT DO UPDATE), so replaying a window is safe.
Watermarking: The stream processor tracks the maximum observed timestamp per partition and delays window closure until stragglers arrive (configurable late-data tolerance, e.g., 30 seconds).
Dead-letter queue: Malformed or oversized payloads are routed to a separate DLQ for inspection rather than dropping them silently.

Scalability Considerations

Horizontal sharding: Kafka partitions by metric name, so all data for a given metric lands on the same consumer and avoids cross-shard merges during aggregation.
Cardinality control: High-cardinality label sets (e.g., per-request IDs) can explode memory. Enforce a label allowlist and cap the number of unique label combinations per metric.
Tiered storage: Raw points are expensive to store long-term. Automatically downsample to coarser resolutions (1m -> 5m -> 1h) and delete raw data after 7 days.
Read path caching: Popular metric queries are served from a Redis cache keyed by (metric_name, labels, window, resolution), refreshed on each rollup flush.

Summary

A metrics aggregation service is an exercise in balancing write throughput against query latency. The key design choices are: (1) decouple ingestion from aggregation with a durable queue, (2) pre-aggregate at multiple resolutions rather than scanning raw data at query time, (3) use idempotent upserts to make the system replay-safe, and (4) control cardinality aggressively to keep memory bounded. These principles apply whether you are building from scratch or extending an open-source TSDB.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What is a metrics aggregation system in system design?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A metrics aggregation system collects, processes, and stores numerical measurements from distributed services. It typically uses time-series databases, stream processing pipelines, and rollup strategies to summarize high-volume data points into queryable statistics at various time granularities.”
}
},
{
“@type”: “Question”,
“name”: “How do you handle high write throughput in a metrics aggregation system?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “High write throughput is handled by using in-memory aggregation buffers, batching writes, partitioning by metric name or host, and employing time-series databases like InfluxDB, Prometheus, or Druid that are optimized for append-heavy workloads. Pre-aggregation at the agent level also reduces ingest load.”
}
},
{
“@type”: “Question”,
“name”: “What are common data models used in metrics aggregation?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Common data models include tagged time-series (metric name + labels + timestamp + value), OpenMetrics/Prometheus exposition format, and columnar storage layouts used by systems like Druid or ClickHouse. The choice of model affects cardinality management and query flexibility.”
}
},
{
“@type”: “Question”,
“name”: “How do companies like Google and Amazon approach metrics aggregation at scale?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Google uses systems like Monarch for planet-scale metrics storage, relying on hierarchical aggregation across zones and regions. Amazon leverages internal systems built on time-series backends with auto-scaling ingest layers. Both emphasize multi-level rollups, downsampling policies, and tight retention controls to manage petabyte-scale telemetry.”
}
}
]
}