Alerting System Low-Level Design: Alert Evaluation, Deduplication, Routing, and On-Call Escalation

Alerting System: Goals and Challenges

An alerting system monitors metrics and fires notifications when conditions violate defined thresholds. The central challenges are minimizing false positives (alert fatigue kills on-call teams), ensuring no true positive is missed (SLA breach), and routing alerts to the right people with enough context to act immediately.

Alert Rule Schema

alert_rules (
  id           BIGSERIAL PRIMARY KEY,
  name         TEXT,             -- "HighErrorRate"
  expr         TEXT,             -- PromQL: error_rate > 0.05
  for_duration INTERVAL,         -- must be true for this long before firing
  severity     TEXT,             -- critical, warning, info
  labels       JSONB,            -- {team: "payments", service: "checkout"}
  annotations  JSONB,            -- {summary: "Error rate above 5%", runbook: "https://..."}
  receiver     TEXT              -- routing key
)

The for duration is the most important parameter for avoiding flapping. A spike that lasts 30 seconds does not page anyone. A condition sustained for 5 minutes means something is genuinely wrong.

Alert State Machine

Each alert rule instance (per unique label set) moves through states:

  • INACTIVE: Condition is not met. No alert.
  • PENDING: Condition is met, but for duration has not elapsed. Alertmanager is not notified yet.
  • FIRING: Condition has been met continuously for the full for duration. Notification sent.
  • RESOLVED: Condition is no longer met after being FIRING. Resolution notification sent.

The PENDING state is what prevents paging on transient spikes. It also means alerts have inherent detection latency equal to the evaluation interval plus the for duration — factor this into SLA definitions.

Alert Evaluation Pipeline

The evaluation loop runs on a configurable schedule (typically every 15–60 seconds):

  1. Query the metrics store with the rule's PromQL expression.
  2. For each result label set, compare against the threshold.
  3. Update the state machine for that (rule, label_set) instance.
  4. Emit FIRING or RESOLVED events to Alertmanager.

Rules are evaluated in parallel groups. Rules within a group share the same evaluation interval and are evaluated sequentially to avoid partial-state reads (e.g., a recording rule result must be written before a rule that reads it).

Deduplication via Fingerprinting

An alert fingerprint is derived from the alert name and its label set: fingerprint = hash(alertname + sorted_labels). Alertmanager uses the fingerprint to deduplicate: if an alert with the same fingerprint is already FIRING, do not send another notification. This prevents notification storms when the evaluation loop fires repeatedly for the same ongoing condition.

Alert Grouping

When many alerts fire simultaneously (e.g., a datacenter goes down), sending one notification per alert floods on-call with noise. Grouping combines related alerts into a single notification.

route:
  group_by: [alertname, datacenter]
  group_wait: 30s       # wait for more alerts before sending
  group_interval: 5m    # send updates every 5 minutes if group changes
  repeat_interval: 4h   # resend if still firing after 4 hours

A “datacenter down” group might contain 50 individual service alerts, but the on-call receives one message: “50 alerts firing in dc-east, root cause: datacenter outage.”

Inhibition Rules

Inhibition suppresses child alerts when a parent alert is firing, preventing redundant pages. Example: if a host is down (parent), suppress all service alerts on that host (children) — the engineer already knows the host is down.

inhibit_rules:
  - source_match:
      alertname: HostDown
    target_match:
      severity: warning
    equal: [datacenter, host]

The equal field ensures inhibition only applies when source and target alerts share the same label values — a host down in dc-east does not inhibit alerts from dc-west.

Routing Tree

Alertmanager routes alerts through a tree of matchers to receivers. Each node matches on labels and forwards to a receiver (PagerDuty, Slack, email, webhook) or a child route for more specific matching.

route:
  receiver: default-slack
  routes:
    - match: {severity: critical}
      receiver: pagerduty-critical
      routes:
        - match: {team: payments}
          receiver: pagerduty-payments
    - match: {severity: warning}
      receiver: slack-warnings

Critical payments alerts go to the payments team's PagerDuty. Critical alerts from other teams go to the general critical PagerDuty. Warnings go to Slack for async review.

On-Call Rotation and Escalation

On-call schedules define who is primary and secondary for each time window. Escalation policies define what happens if the primary does not acknowledge within N minutes:

  • 0 min: Page primary on-call via PagerDuty mobile push + phone call.
  • 5 min: No ack → page secondary on-call.
  • 15 min: No ack → page engineering manager + team Slack channel.
  • 30 min: No ack → page VP Engineering + incident commander.

Acknowledgment stops escalation. Resolution closes the incident and records MTTA (mean time to acknowledge) and MTTR (mean time to resolve) metrics.

Alert Fatigue Prevention

  • Silences: Temporarily suppress alerts for known maintenance or known issues being worked. Time-limited with a required comment and owner.
  • Maintenance windows: Schedule recurring silences for planned maintenance (e.g., every Sunday 2–4 AM for DB backups). Automatically expire.
  • Dependency-aware grouping: Build a service dependency map and suppress leaf alerts when the root dependency is alerting — similar to inhibition but derived from topology rather than manually configured rules.
  • Alert review cadence: Monthly audit of firing rate per alert rule. Rules with high false-positive rates get tuned (raise threshold, increase for duration) or deleted. Alerts that never fire get reviewed for relevance.

Trade-offs and Failure Modes

  • Alertmanager HA: Run multiple Alertmanager instances in a cluster using a gossip protocol (Memberlist) for deduplication state. All instances receive all alerts; gossip ensures only one sends the notification.
  • Metrics store outage: If Prometheus is down, alert evaluation stops. “Dead man's switch” pattern: a Watchdog alert always fires. An external service (Deadman's Snitch) expects a heartbeat from this alert; if the heartbeat stops, the external service pages the team.
  • Notification channel outage: PagerDuty down during an incident. Maintain a secondary receiver (direct SMS, backup email) for critical alerts. Test the backup path monthly.

See also: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence

See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering

See also: Anthropic Interview Guide 2026: Process, Questions, and AI Safety

See also: Atlassian Interview Guide

Scroll to Top