Monitoring and Alerting System: Low-Level Design

A monitoring and alerting system collects metrics from distributed services, stores them efficiently, evaluates alert conditions, and notifies on-call engineers. Building one correctly requires understanding time-series data, alert evaluation semantics, and the operational properties of the pipeline itself — the monitoring system must be more reliable than the services it monitors.

Metrics Collection

Push vs. Pull

Pull-based (Prometheus model): the metrics server scrapes each service endpoint on a schedule. Services expose a /metrics endpoint; the server controls scrape frequency. Advantages: scrape failures are visible to the metrics server, no client-side buffering needed, service discovery is centralized. Pull works well in static infrastructure with known targets.

Push-based (StatsD, InfluxDB model): services push metrics to a collection agent. Better for dynamic environments (containers, lambdas), short-lived jobs, and when services can’t expose an HTTP endpoint. Push requires handling backpressure — if the collection agent is slow, services must buffer or drop metrics.

Metric Types

Counter: monotonically increasing (total requests, errors). Rate of change computed by the time-series database. Gauge: point-in-time value (current queue depth, active connections). Histogram: distribution of values (request latency buckets); enables percentile computation. Summary: pre-computed quantiles computed on the client (less flexible than histograms, cannot be aggregated across instances).

Time-Series Storage

Time-series data (timestamp, value, labels) has specific access patterns: append-only writes, time-range reads, aggregation over time windows. Dedicated time-series databases (Prometheus TSDB, InfluxDB, VictoriaMetrics) compress sequential timestamps and float values efficiently (delta-of-delta encoding, Gorilla compression). A 16-byte float timestamp pair compresses to 1.37 bytes on average with Gorilla compression.

Retention strategy: store high-resolution data (15s scrape interval) for 15 days; downsample to 1-minute resolution for 90 days; downsample to 1-hour resolution for 2 years. This balances storage cost against the need for historical trend analysis. Downsampling aggregates: min, max, avg, sum over the interval.

Alert Evaluation

Threshold-Based Alerts

The simplest form: fire when a metric crosses a static threshold for a sustained duration. Example: alert if error_rate > 5% for 5 minutes. The sustained duration requirement (“for 5 minutes”) reduces false positives from transient spikes. Trade-off: longer duration means slower detection. For critical metrics (disk full), use shorter durations; for noisy metrics, use longer ones.

Anomaly Detection

Static thresholds miss seasonal patterns — a 10% error rate at 3am may be anomalous while 10% at noon may be normal. Anomaly-based alerts compare current values against a baseline (same time last week, rolling average ± N standard deviations). Requires more data and more complex evaluation, but produces fewer false positives for metrics with known diurnal or weekly patterns.

Alert Evaluation Pipeline

Alert rules are evaluated periodically (every 15-60 seconds) by a rule evaluation engine. For each rule: execute the query against the time-series DB, evaluate the condition, track state (pending → firing → resolved). A pending state means the condition is true but hasn’t held for the required duration. This state machine prevents alerts from firing and resolving on every evaluation tick during intermittent conditions.

Alert Routing and Deduplication

Multiple firing alerts should not generate duplicate pages. An alert manager (Prometheus Alertmanager, PagerDuty) deduplicates alerts by grouping them by labels (service, environment, severity). Grouping: related alerts (all pods of the same service failing) are bundled into one notification. Inhibition: a high-severity alert (service down) suppresses lower-severity alerts from the same service (high latency — obviously, since it’s down). Silences: suppress alerts during known maintenance windows.

Notification Channels

Route by severity: P1 (customer-impacting) → PagerDuty (wakes on-call), Slack #incidents; P2 (degraded, not down) → Slack #alerts, email; P3 (warning, no action needed) → Slack #monitoring only. On-call rotation integration: alerts route to the current on-call engineer based on a rotation schedule. Escalation: if the primary on-call doesn’t acknowledge within N minutes, escalate to secondary.

Monitoring the Monitoring System

The monitoring system must itself be monitored. Dead man’s switch: the monitoring system continuously sends a heartbeat to an external watchdog service. If the heartbeat stops, the watchdog pages the team — detecting monitoring outages. Monitor scrape success rates: if a target stops being scraped, alert. Use multiple independent monitoring systems for critical infrastructure (primary Prometheus + secondary DataDog) to avoid single points of failure in the observability stack.

SLO-Based Alerting

Rather than alerting on raw metrics, alert on error budget burn rate. If a 30-day SLO allows 0.1% error rate, the monthly error budget is 43 minutes of downtime. Alert when the burn rate exceeds 14.4x normal (consuming the entire budget in 2 hours). This approach (from Google’s SRE book) focuses alerts on SLO risk rather than individual metrics, reducing alert fatigue from metrics that exceed thresholds without actually threatening the SLO.

{ “@context”: “https://schema.org”, “@type”: “FAQPage”, “mainEntity”: [ { “@type”: “Question”, “name”: “What is the difference between push and pull metrics collection?”, “acceptedAnswer”: {“@type”: “Answer”, “text”: “Pull-based (Prometheus model): the metrics server scrapes each service’s /metrics endpoint on a schedule. The server controls scrape frequency and scrape failures are visible centrally. Best for static infrastructure where targets are known. Push-based (StatsD, Telegraf): services push metrics to a collection agent. Better for ephemeral workloads (containers, lambdas), short-lived jobs, and services behind firewalls that can’t expose HTTP endpoints. Push requires the collection agent to handle backpressure — if the agent is slow, services must buffer or drop metrics. Most production systems use pull for long-running services and push for jobs and ephemeral workloads.”} }, { “@type”: “Question”, “name”: “How does time-series data compression work?”, “acceptedAnswer”: {“@type”: “Answer”, “text”: “Time-series databases use specialized compression for (timestamp, value) pairs. Timestamp compression: store delta-of-delta — the difference between consecutive timestamp differences. Most scrape intervals are regular (every 15s), so delta-of-delta is near-zero and compresses heavily. Value compression (Gorilla algorithm from Facebook): store XOR of consecutive float values rather than raw floats; XOR of similar values has leading zeros that compress well. Together these achieve roughly 1.37 bytes per sample versus 16 bytes raw — about 12x compression. This is why Prometheus and VictoriaMetrics can store years of metrics without massive storage costs.”} }, { “@type”: “Question”, “name”: “What is alert burn rate and why use it instead of raw thresholds?”, “acceptedAnswer”: {“@type”: “Answer”, “text”: “Alert burn rate measures how quickly you are consuming your error budget relative to SLO. If a 30-day SLO allows 0.1% error rate, the monthly budget is 43 minutes of downtime. A burn rate of 1x means you are consuming budget at exactly the SLO rate (sustainable). A burn rate of 14.4x means you would exhaust the entire month’s budget in 2 hours — page immediately. Alerting on burn rate (from Google’s SRE book) produces fewer false positives than raw thresholds: a 5% error rate at 3am (low traffic) may have the same budget impact as 0.5% at peak. Burn rate alerts fire based on business impact, not raw metric values.”} }, { “@type”: “Question”, “name”: “What is a dead man’s switch in monitoring?”, “acceptedAnswer”: {“@type”: “Answer”, “text”: “A dead man’s switch detects monitoring system failures by inverting the alerting model: instead of alerting when something goes wrong, the monitoring system continuously sends a heartbeat to an external watchdog. If the heartbeat stops for N minutes, the watchdog pages the team. This detects scenarios where the monitoring system itself fails — a Prometheus crash, a network partition isolating the metrics server, or a misconfiguration that stops alert evaluation. Without a dead man’s switch, a failed monitoring system silently stops alerting, creating the illusion of a healthy system. Services like Healthchecks.io and Dead Man’s Snitch implement this pattern.”} } ] }