Low Level Design: Metrics Collection Service

A metrics collection service captures numeric measurements over time, stores them efficiently, and powers alerting and dashboards. This post covers metric types, collection models, time series storage, aggregation, alerting, and the cardinality problem.

Metric Types

Four fundamental types cover nearly every monitoring use case:

Counter — monotonically increasing value, reset only on restart. Measures cumulative events: requests served, bytes sent, errors. Rate-of-change queries (requests per second) are derived via the rate() function.
Gauge — arbitrary value that can go up or down. Measures current state: memory used, active connections, queue depth.
Histogram — samples observations into configurable buckets and tracks sum and count. Used to compute quantiles server-side (e.g., 95th percentile latency). Bucketing is defined at instrumentation time; changing buckets requires redeployment.
Summary — pre-calculates quantiles on the client side (e.g., 0.5, 0.9, 0.99 quantiles). Quantiles are accurate but cannot be aggregated across instances — a significant limitation in horizontally scaled systems. Histograms are generally preferred for that reason.

Push vs Pull Collection

Pull model (Prometheus) — the metrics server scrapes each target on a configurable interval (e.g., every 15 seconds). Targets expose an HTTP endpoint (/metrics) in the Prometheus text format. Advantages: the server controls the scrape rate; targets do not need to know the server address; scrape failures are immediately visible as a gap. Disadvantage: services behind NAT or short-lived (batch jobs, lambdas) cannot be scraped. Pushgateway bridges this gap for batch jobs.

Push model (StatsD, OpenTelemetry push exporter) — services emit metrics to a collector over UDP (StatsD) or gRPC (OTLP). Advantages: works for ephemeral workloads; no need to expose an HTTP port per service. Disadvantages: UDP is fire-and-forget (drops under load); the collector must handle bursty writes; back-pressure is harder to implement.

Time Series Storage

A time series is identified by a metric name plus a set of labels (key-value pairs). Each series is a sequence of (timestamp, value) pairs. Storage engines optimize for sequential writes and range reads on a single series:

Prometheus TSDB — local, single-node. Chunks of 120 samples compressed with XOR delta encoding. Suitable for single-datacenter deployments with retention up to ~15 days on a single node.
Thanos — extends Prometheus with long-term storage by shipping compacted blocks to object storage (S3). Queries fan out across multiple Prometheus instances via a sidecar.
M3DB — distributed time series database designed for high write throughput. Used by Uber at large scale.
InfluxDB — columnar TSM storage engine, supports both push and pull, has its own query language (Flux). Strong choice for IoT and high-cardinality workloads.

Aggregation and Downsampling

Raw high-resolution data is expensive to store indefinitely. Downsampling compresses old data by replacing raw samples with aggregated values:

0–7 days: raw samples at 5-second resolution
7–30 days: 1-minute aggregates (min, max, sum, count per window)
30 days+: 1-hour aggregates

Thanos Compactor and Cortex ruler handle this automatically. The trade-off is query fidelity: a 1-hour aggregate cannot reconstruct per-second spikes, so short-duration anomalies become invisible in long-range views.

Alerting Rule Engine

Prometheus alerting rules are PromQL expressions evaluated on a configurable interval. A rule fires when the expression returns a non-empty result set. The for clause requires the condition to persist for a duration before the alert transitions to FIRING — this suppresses transient spikes. Example: an error rate alert fires when rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.01 holds for more than 2 minutes. Alertmanager receives firing alerts and routes them to PagerDuty, Slack, or email, with deduplication and grouping.

Cardinality Explosion

Every unique combination of label values creates a distinct time series. Adding a high-cardinality label — user_id, request_id, full URL path — can multiply the series count by millions. This blows up memory (Prometheus keeps the index in RAM), slows queries, and can crash the server. Mitigations: enforce label value allow-lists at the scrape layer, use recording rules to pre-aggregate high-cardinality dimensions, and monitor the series count per job with prometheus_tsdb_head_series.

Dashboard Integration

Grafana is the standard visualization layer. It connects to Prometheus (or Thanos / Cortex) as a data source and renders panels using PromQL queries. Dashboards are version-controlled as JSON. For a metrics service, standard dashboard panels include: per-service request rate and error rate (RED method), per-host CPU/memory/disk (USE method), histogram heatmaps for latency distributions, and alert state history. Grafana Alerting can also evaluate rules directly against data source queries, reducing the need to duplicate alert logic in Prometheus.

Frequently Asked Questions

What are the different metric types in a metrics service?

The four canonical metric types are: counters (monotonically increasing totals, e.g., requests served), gauges (point-in-time values that can go up or down, e.g., memory used), histograms (distribution of observed values bucketed by magnitude, e.g., request latency), and summaries (pre-computed quantiles on the client side). Histograms are preferred over summaries for server-side aggregation because their bucket counts can be added across instances; summaries cannot.

What is the difference between push and pull metric collection?

In a pull model the metrics server scrapes each instrumented service on a schedule, fetching the current snapshot of all metrics from a well-known HTTP endpoint. This centralizes discovery, gives the server control over collection rate, and makes it easy to detect when a target is down. In a push model services actively send metric batches to a collection endpoint (e.g., StatsD, InfluxDB line protocol). Push suits ephemeral jobs with short lifetimes that a scrape interval might miss, and environments where targets cannot expose inbound ports.

How does time series downsampling work for long-term retention?

As raw metric data ages, storing every scrape point becomes prohibitively expensive. Downsampling computes aggregate rollups — typically min, max, sum, count, and average — over fixed intervals (e.g., 5-minute, 1-hour) and writes these summaries to a cheaper storage tier while deleting or compressing the raw points. Query engines transparently select the finest-granularity rollup that covers the requested time range, so dashboards spanning months still render quickly at the cost of losing sub-interval detail for old data.

What is the cardinality explosion problem and how do you prevent it?

Cardinality explosion occurs when a metric is labeled with a high-cardinality dimension — such as user ID, request URL with dynamic path segments, or unbounded error messages — causing the number of unique time series to grow into the millions. Each series consumes memory in the TSDB index and storage on disk, degrading ingestion and query performance. Prevention strategies include reviewing label schemas before instrumentation, bounding or hashing high-cardinality values, using cardinality limits with hard rejection at the ingestion layer, and routing high-cardinality data to a log or tracing pipeline instead.