Metrics Aggregation Service Low-Level Design: Time-Series Ingestion, Downsampling, and Query Engine

Metrics Aggregation: Purpose and Scope

A metrics aggregation service collects numeric measurements from infrastructure and applications, stores them as time series, and provides a query engine for dashboards, alerts, and capacity planning. The core challenge is ingesting millions of data points per second from thousands of sources while keeping storage costs manageable and query latency low.

Metric Types

Counter: Monotonically increasing value (requests served, bytes sent). Queried as rate-of-change. Never resets in theory — handle process restarts by detecting counter resets and adjusting.
Gauge: Point-in-time snapshot (memory usage, active connections, temperature). Can go up or down.
Histogram: Distribution of values (request latency) bucketed into configurable ranges. Enables percentile calculations (P50, P95, P99) from pre-aggregated buckets — more efficient than storing raw samples.
Summary: Pre-calculated percentiles at the client. Less flexible than histograms for server-side aggregation but lower query cost.

Collection Models: Pull vs Push

Pull (Prometheus model): The metrics server periodically scrapes HTTP endpoints (/metrics) on each target. Scrape interval typically 15–60 seconds. Advantages: collector controls the scrape rate; targets that disappear are immediately detected. Disadvantage: targets must expose an HTTP server; firewall traversal can be complex.
Push (StatsD/OTLP model): Applications push metrics to an agent or gateway. UDP-based StatsD is fire-and-forget with minimal application overhead. OpenTelemetry OTLP over gRPC provides reliable delivery with backpressure. Advantages: works for short-lived jobs (lambdas, batch processes) that Prometheus cannot scrape before they exit.

Pre-Aggregation at Agent

Sending every individual data point to the central server is wasteful. Agents (Prometheus node_exporter, OpenTelemetry collector) pre-aggregate locally:

-- Ring buffer per metric (10-second flush window)
ring_buffer[metric_key] = {sum, count, min, max, last_timestamp}
-- On flush: emit one aggregated sample per metric

This reduces ingestion volume by orders of magnitude. For histograms, accumulate bucket counts locally and flush the entire histogram every scrape interval.

Ingestion Pipeline and Write-Ahead Log

Incoming samples flow through: agent → ingestion API → write-ahead log (WAL) → memory chunks → disk segments. The WAL ensures durability: if the process crashes mid-write, replay the WAL on restart. Prometheus uses a custom WAL with 128MB segments. Writes are sequential (fast on spinning disk and SSDs alike) and are compacted periodically into read-optimized blocks.

Time Series Storage: TSM

InfluxDB's Time-Structured Merge Tree (TSM) organizes data as columnar chunks per metric+label combination per time range. Each chunk stores timestamps and values compressed together (delta-of-delta encoding for timestamps, XOR compression for float values — Gorilla algorithm achieves ~1.37 bytes per sample for typical metrics).

chunk (
  series_id   BIGINT,    -- hash of metric name + sorted labels
  min_time    TIMESTAMP,
  max_time    TIMESTAMP,
  data        BYTES      -- compressed timestamp+value pairs
)

Label Cardinality Management

Each unique combination of label values creates a new time series. High-cardinality labels like user_id, request_id, or ip_address on metrics explode the number of series (cardinality explosion), causing memory exhaustion and slow queries. Rules:

Keep label values to bounded sets: status_code (200, 404, 500), region (us-east, eu-west), method (GET, POST).
Reject metrics exceeding cardinality limits at ingestion time.
For high-cardinality data, use logs or traces instead of metrics.

Downsampling and Retention Tiers

Raw data at 15-second resolution is expensive to store for a year. Downsampling aggregates older data into lower-resolution rollups:

Raw (15s): retained for 7 days — high precision for recent debugging.
1-minute rollup: retained for 30 days — sufficient for weekly trend analysis.
1-hour rollup: retained for 1 year — adequate for capacity planning.

Each rollup stores five aggregates per interval: min, max, avg, sum, count. Rollup jobs run continuously, consuming raw data and writing to the rollup tier. Queries automatically use the coarsest available tier that satisfies the requested resolution.

PromQL Query Engine

Key PromQL operations and their implementation patterns:

-- Rate: per-second increase of a counter over 5 minutes
rate(http_requests_total[5m])

-- Aggregation across label dimensions
sum by (service) (rate(http_requests_total[5m]))

-- Histogram quantile from buckets
histogram_quantile(0.99, sum by (le) (rate(request_duration_bucket[5m])))

The query engine fetches chunks for matching series, decompresses them, aligns timestamps, and applies the function. Range queries (for graphs) are expensive — recording rules pre-compute common expensive queries and store results as new time series, reducing query time from seconds to milliseconds.

Multi-Datacenter Federation

Each datacenter runs an independent Prometheus instance scraping local targets. A global Prometheus federates by scraping pre-aggregated metrics from each regional instance. This avoids shipping all raw samples across WAN links while providing a global view for cross-region dashboards and SLO tracking.

Trade-offs and Failure Modes

Scrape gaps: If a scrape target is unreachable, gaps appear in the time series. Prometheus marks missing samples as NaN; alerting rules must account for gaps using absent() or for duration tolerance.
WAL corruption on crash: Partial WAL writes on power loss can corrupt the head block. Use checksums on WAL records and test recovery paths regularly.
Rollup lag: If the rollup job falls behind, queries for recent data in the rollup tier return stale or missing results. Fall back to raw tier transparently when rollup is unavailable.
Cardinality explosion from bad deploys: A new service that includes request_id in metric labels can add millions of series instantly. Enforce cardinality limits at the ingestion gateway and alert on rapid series growth.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How does time-series storage use TSM (time-structured merge tree)?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “TSM organizes data into immutable TSM files containing compressed blocks of time-ordered values per series, with an in-memory WAL and cache absorbing recent writes before compaction. Background compaction merges smaller TSM files into larger ones, applying delta encoding and Gorilla-style XOR compression on floating-point values to achieve high compression ratios for time-series workloads.”
}
},
{
“@type”: “Question”,
“name”: “How is label cardinality managed to prevent metric explosion?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Ingestion pipelines enforce a per-metric label allowlist and cap the number of unique label value combinations (time series) per metric name, rejecting or dropping series that exceed the cardinality limit. High-cardinality label values such as user IDs or request UUIDs are blocked at the SDK or agent level via static validation rules before they reach the storage layer.”
}
},
{
“@type”: “Question”,
“name”: “How does downsampling preserve historical data while reducing storage?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A background compaction job reads raw data points at fine resolution (e.g., 15s), computes aggregate rollups (min, max, sum, count) into coarser buckets (5m, 1h, 1d), and writes them to separate retention tiers. Query routing logic automatically selects the appropriate resolution tier based on the requested time range, serving long-range queries from downsampled data to minimize scan cost.”
}
},
{
“@type”: “Question”,
“name”: “How does PromQL rate() function handle counter resets?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “PromQL's rate() function detects counter resets by checking whether a sample value is less than the previous sample, treating the decrease as a reset and adding the previous value to the running total before continuing the delta calculation. This allows accurate per-second rate computation across process restarts or counter overflows without manual intervention from the operator.”
}
}
]
}