Low Level Design: Metrics Collection and Monitoring System

Introduction

Metrics collection is the foundation of system observability. The four core metric types cover most monitoring needs: a counter is a monotonically increasing value (total requests served), a gauge is a current snapshot value (active connections, memory usage), a histogram records the distribution of a measured value across predefined buckets (request latency in ms ranges), and a summary computes configurable quantiles (p50, p95, p99) over a sliding time window.

Pull vs Push Collection

In the pull model (used by Prometheus), a central scraper fetches the /metrics endpoint from each service instance on a schedule — typically every 15 seconds. Service discovery via Kubernetes pod annotations or Consul service catalog populates the target list dynamically, so new instances are scraped automatically. In the push model (used by StatsD and InfluxDB line protocol), services push metric values to an aggregator. Push is better suited for short-lived jobs such as batch scripts or Lambda functions that may complete before the next scrape cycle. Many systems combine both: Prometheus scrapes long-running services while a Pushgateway accepts metrics from short-lived jobs.

Prometheus Architecture

Prometheus scrape jobs collect metrics from discovered targets according to scrape_interval and scrape_timeout settings. Each scraped sample is stored in the local TSDB as a tuple of (metric_name, label_set, timestamp, float64_value). PromQL provides aggregation functions including sum, rate (for counters), and histogram_quantile (for latency percentiles from histograms). For multi-cluster deployments, federation allows a higher-level Prometheus instance to scrape pre-aggregated metrics from lower-level instances. Thanos or Cortex extend Prometheus with global query across clusters and long-term retention beyond local disk capacity.

TSDB Storage

Prometheus TSDB accumulates incoming samples in a 2-hour in-memory block (the “head block”), then flushes it to disk as an immutable chunk file. Float64 values are compressed using Gorilla XOR delta encoding, which exploits the fact that consecutive samples from the same time series are often very similar. An inverted index maps (metric_name + label combinations) to chunk file locations for efficient query. During queries, the engine reads only the chunks that overlap the requested time range. Older blocks are compacted periodically to merge small files and apply downsampling. Remote write ships raw samples to object storage (S3) for retention beyond the local disk window.

Recording Rules

Recording rules pre-compute expensive PromQL expressions and store results as new time series in the TSDB. For example, the rule record: job:request_rate5m:rate evaluates rate(http_requests_total[5m]) per job label and writes a new metric. Dashboard panels query this pre-computed metric instead of running the expensive range query at render time. This dramatically reduces query latency for complex aggregations over high-cardinality label sets. Recording rules are evaluated on the same evaluation_interval as alert rules (typically 15 seconds) and stored as regular metrics with configurable retention.

Alerting Pipeline

Prometheus evaluates alert rules on every evaluation_interval (15 seconds). An alert rule specifies: IF a PromQL expression exceeds a threshold FOR a minimum duration (e.g., 5 minutes), THEN transition to FIRING state. This “for” duration prevents flapping on transient spikes. Firing alerts are sent to AlertManager, which deduplicates identical alerts from multiple Prometheus instances, groups related alerts (e.g., all alerts from a single cluster into one notification), and routes them to the appropriate receiver based on label matchers. Inhibition rules suppress downstream alerts when an upstream alert is already firing (e.g., suppress service alerts when the host is down). Silences mute matching alerts during planned maintenance windows.

On-Call Routing

AlertManager routes alerts based on severity and team labels attached to each alert. P1 (critical) alerts page the on-call engineer immediately via PagerDuty with a phone call escalation. P2 (warning) alerts post to the team’s Slack channel and create a ticket in the issue tracker for next-business-day review. If a P1 alert is not acknowledged within 15 minutes, AlertManager’s repeat_interval triggers an escalation to the secondary on-call and the engineering manager. On-call schedules are managed in PagerDuty using weekly rotation with a primary and backup engineer, ensuring 24/7 coverage without single points of failure.

Frequently Asked Questions: Metrics Collection and Monitoring System Design

What are the tradeoffs between Prometheus pull and push models for metrics collection?

In Prometheus’s pull model, the server scrapes metrics from instrumented targets on a defined interval — this gives the server full control over collection rate, makes it easy to detect targets that have stopped responding (scrape failures are explicit), and simplifies service discovery. The push model (used by Graphite, InfluxDB, and Prometheus’s Pushgateway) has targets send metrics to the collector, which works better for short-lived jobs and batch processes that may complete before the next scrape. Pull has drawbacks in environments with strict firewall rules where the collector cannot reach targets, or at very high cardinality where scraping thousands of targets creates head-of-line blocking. For long-running services in a controlled network, pull is operationally simpler. For ephemeral workloads, push (via Pushgateway or OTLP) is the right choice.

How does Gorilla XOR encoding compress floating-point time series data?

Gorilla (Facebook’s TSDB) encodes floats by XOR-ing each value against the previous value. Since consecutive metric samples are often similar (CPU at 45.2%, then 45.3%), XOR produces a result with many leading and trailing zeros. Gorilla stores only the meaningful bits: first it writes the count of leading zeros, then the length of the meaningful XOR bits, then the bits themselves. If the XOR result falls within the same leading/trailing zero boundaries as the previous XOR, it writes just the meaningful bits with a one-bit prefix. In practice this achieves 1.37 bytes per data point on typical monitoring metrics, compared to 8 bytes for raw float64 — roughly a 6x compression ratio with no precision loss and O(1) decode per sample.

When should you use Prometheus recording rules instead of query-time aggregation?

Recording rules pre-compute expensive PromQL expressions on an interval and store the result as a new time series. Use them when: a query fans out over thousands of series (e.g., summing request rates across all pods in a large cluster), the same aggregation powers multiple dashboards or alert rules, or query latency exceeds acceptable dashboard load times. Recording rules eliminate redundant computation and reduce query engine load at the cost of storage for the derived series and a staleness window equal to the evaluation interval. For simple queries over low-cardinality series, query-time aggregation is fine. As a rule of thumb: if a PromQL expression takes more than 2 seconds to execute, it’s a recording rule candidate.

How does AlertManager handle alert deduplication and grouping?

AlertManager receives alert notifications from Prometheus (or compatible sources), groups related alerts by configurable label sets (e.g., alertname + cluster + namespace), and sends a single grouped notification per group rather than one notification per alert. Deduplication works by fingerprinting each alert on its label set — identical fingerprints within the same group are merged. The group_wait parameter delays the first notification to batch alerts that fire together; group_interval controls how often subsequent notifications are sent for an active group; repeat_interval controls re-notification for ongoing alerts. Silences suppress matching alerts for a time window. Inhibition rules suppress lower-priority alerts when a higher-priority alert is already firing (e.g., suppress all service alerts when the entire datacenter is down).

How should on-call routing differ for P1 versus P2 incidents in a metrics monitoring system?

P1 (critical, customer-impacting): page the primary on-call immediately via phone/SMS through PagerDuty or Opsgenie; if not acknowledged within 5 minutes, escalate to secondary on-call; if not acknowledged within another 5 minutes, escalate to the on-call manager and open a war room channel automatically. P2 (degraded, not yet customer-impacting): send to Slack and email; page only if unacknowledged after 30 minutes. Key design points: AlertManager routes by severity label (severity=critical vs severity=warning); PagerDuty schedules and escalation policies enforce the SLA; alert deduplication in AlertManager ensures a flapping metric doesn’t create a paging storm. Runbook URLs should be embedded in alert annotations so the responder has immediate context.

See also: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence

See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering

See also: Meta Interview Guide 2026: Facebook, Instagram, WhatsApp Engineering

Scroll to Top