Question 1

What are the tradeoffs between Prometheus pull and push models for metrics collection?

Accepted Answer

In Prometheus's pull model, the server scrapes metrics from instrumented targets on a defined interval — this gives the server full control over collection rate, makes it easy to detect targets that have stopped responding (scrape failures are explicit), and simplifies service discovery. The push model (used by Graphite, InfluxDB, and Prometheus's Pushgateway) has targets send metrics to the collector, which works better for short-lived jobs and batch processes that may complete before the next scrape. Pull has drawbacks in environments with strict firewall rules where the collector cannot reach targets, or at very high cardinality where scraping thousands of targets creates head-of-line blocking. For long-running services in a controlled network, pull is operationally simpler. For ephemeral workloads, push (via Pushgateway or OTLP) is the right choice.

Question 2

How does Gorilla XOR encoding compress floating-point time series data?

Accepted Answer

Gorilla (Facebook's TSDB) encodes floats by XOR-ing each value against the previous value. Since consecutive metric samples are often similar (CPU at 45.2%, then 45.3%), XOR produces a result with many leading and trailing zeros. Gorilla stores only the meaningful bits: first it writes the count of leading zeros, then the length of the meaningful XOR bits, then the bits themselves. If the XOR result falls within the same leading/trailing zero boundaries as the previous XOR, it writes just the meaningful bits with a one-bit prefix. In practice this achieves 1.37 bytes per data point on typical monitoring metrics, compared to 8 bytes for raw float64 — roughly a 6x compression ratio with no precision loss and O(1) decode per sample.

Question 3

When should you use Prometheus recording rules instead of query-time aggregation?

Accepted Answer

Recording rules pre-compute expensive PromQL expressions on an interval and store the result as a new time series. Use them when: a query fans out over thousands of series (e.g., summing request rates across all pods in a large cluster), the same aggregation powers multiple dashboards or alert rules, or query latency exceeds acceptable dashboard load times. Recording rules eliminate redundant computation and reduce query engine load at the cost of storage for the derived series and a staleness window equal to the evaluation interval. For simple queries over low-cardinality series, query-time aggregation is fine. As a rule of thumb: if a PromQL expression takes more than 2 seconds to execute, it's a recording rule candidate.

Question 4

How does AlertManager handle alert deduplication and grouping?

Accepted Answer

AlertManager receives alert notifications from Prometheus (or compatible sources), groups related alerts by configurable label sets (e.g., alertname + cluster + namespace), and sends a single grouped notification per group rather than one notification per alert. Deduplication works by fingerprinting each alert on its label set — identical fingerprints within the same group are merged. The group_wait parameter delays the first notification to batch alerts that fire together; group_interval controls how often subsequent notifications are sent for an active group; repeat_interval controls re-notification for ongoing alerts. Silences suppress matching alerts for a time window. Inhibition rules suppress lower-priority alerts when a higher-priority alert is already firing (e.g., suppress all service alerts when the entire datacenter is down).

Question 5

How should on-call routing differ for P1 versus P2 incidents in a metrics monitoring system?

Accepted Answer

P1 (critical, customer-impacting): page the primary on-call immediately via phone/SMS through PagerDuty or Opsgenie; if not acknowledged within 5 minutes, escalate to secondary on-call; if not acknowledged within another 5 minutes, escalate to the on-call manager and open a war room channel automatically. P2 (degraded, not yet customer-impacting): send to Slack and email; page only if unacknowledged after 30 minutes. Key design points: AlertManager routes by severity label (severity=critical vs severity=warning); PagerDuty schedules and escalation policies enforce the SLA; alert deduplication in AlertManager ensures a flapping metric doesn't create a paging storm. Runbook URLs should be embedded in alert annotations so the responder has immediate context.

Low Level Design: Metrics Collection and Monitoring System

Introduction

Pull vs Push Collection

Prometheus Architecture

TSDB Storage

Recording Rules

Alerting Pipeline

On-Call Routing

Frequently Asked Questions: Metrics Collection and Monitoring System Design

What are the tradeoffs between Prometheus pull and push models for metrics collection?

How does Gorilla XOR encoding compress floating-point time series data?

When should you use Prometheus recording rules instead of query-time aggregation?

How does AlertManager handle alert deduplication and grouping?

How should on-call routing differ for P1 versus P2 incidents in a metrics monitoring system?