Alerting System Low-Level Design: Alert Evaluation, Deduplication, Routing, and On-Call Escalation

Alerting System: Goals and Challenges

An alerting system monitors metrics and fires notifications when conditions violate defined thresholds. The central challenges are minimizing false positives (alert fatigue kills on-call teams), ensuring no true positive is missed (SLA breach), and routing alerts to the right people with enough context to act immediately.

Alert Rule Schema

alert_rules (
  id           BIGSERIAL PRIMARY KEY,
  name         TEXT,             -- "HighErrorRate"
  expr         TEXT,             -- PromQL: error_rate > 0.05
  for_duration INTERVAL,         -- must be true for this long before firing
  severity     TEXT,             -- critical, warning, info
  labels       JSONB,            -- {team: "payments", service: "checkout"}
  annotations  JSONB,            -- {summary: "Error rate above 5%", runbook: "https://..."}
  receiver     TEXT              -- routing key
)

The for duration is the most important parameter for avoiding flapping. A spike that lasts 30 seconds does not page anyone. A condition sustained for 5 minutes means something is genuinely wrong.

Alert State Machine

Each alert rule instance (per unique label set) moves through states:

INACTIVE: Condition is not met. No alert.
PENDING: Condition is met, but for duration has not elapsed. Alertmanager is not notified yet.
FIRING: Condition has been met continuously for the full for duration. Notification sent.
RESOLVED: Condition is no longer met after being FIRING. Resolution notification sent.

The PENDING state is what prevents paging on transient spikes. It also means alerts have inherent detection latency equal to the evaluation interval plus the for duration — factor this into SLA definitions.

Alert Evaluation Pipeline

The evaluation loop runs on a configurable schedule (typically every 15–60 seconds):

Query the metrics store with the rule's PromQL expression.
For each result label set, compare against the threshold.
Update the state machine for that (rule, label_set) instance.
Emit FIRING or RESOLVED events to Alertmanager.

Rules are evaluated in parallel groups. Rules within a group share the same evaluation interval and are evaluated sequentially to avoid partial-state reads (e.g., a recording rule result must be written before a rule that reads it).

Deduplication via Fingerprinting

An alert fingerprint is derived from the alert name and its label set: fingerprint = hash(alertname + sorted_labels). Alertmanager uses the fingerprint to deduplicate: if an alert with the same fingerprint is already FIRING, do not send another notification. This prevents notification storms when the evaluation loop fires repeatedly for the same ongoing condition.

Alert Grouping

When many alerts fire simultaneously (e.g., a datacenter goes down), sending one notification per alert floods on-call with noise. Grouping combines related alerts into a single notification.

route:
  group_by: [alertname, datacenter]
  group_wait: 30s       # wait for more alerts before sending
  group_interval: 5m    # send updates every 5 minutes if group changes
  repeat_interval: 4h   # resend if still firing after 4 hours

A “datacenter down” group might contain 50 individual service alerts, but the on-call receives one message: “50 alerts firing in dc-east, root cause: datacenter outage.”

Inhibition Rules

Inhibition suppresses child alerts when a parent alert is firing, preventing redundant pages. Example: if a host is down (parent), suppress all service alerts on that host (children) — the engineer already knows the host is down.

inhibit_rules:
  - source_match:
      alertname: HostDown
    target_match:
      severity: warning
    equal: [datacenter, host]

The equal field ensures inhibition only applies when source and target alerts share the same label values — a host down in dc-east does not inhibit alerts from dc-west.

Routing Tree

Alertmanager routes alerts through a tree of matchers to receivers. Each node matches on labels and forwards to a receiver (PagerDuty, Slack, email, webhook) or a child route for more specific matching.

route:
  receiver: default-slack
  routes:
    - match: {severity: critical}
      receiver: pagerduty-critical
      routes:
        - match: {team: payments}
          receiver: pagerduty-payments
    - match: {severity: warning}
      receiver: slack-warnings

Critical payments alerts go to the payments team's PagerDuty. Critical alerts from other teams go to the general critical PagerDuty. Warnings go to Slack for async review.

On-Call Rotation and Escalation

On-call schedules define who is primary and secondary for each time window. Escalation policies define what happens if the primary does not acknowledge within N minutes:

0 min: Page primary on-call via PagerDuty mobile push + phone call.
5 min: No ack → page secondary on-call.
15 min: No ack → page engineering manager + team Slack channel.
30 min: No ack → page VP Engineering + incident commander.

Acknowledgment stops escalation. Resolution closes the incident and records MTTA (mean time to acknowledge) and MTTR (mean time to resolve) metrics.

Alert Fatigue Prevention

Silences: Temporarily suppress alerts for known maintenance or known issues being worked. Time-limited with a required comment and owner.
Maintenance windows: Schedule recurring silences for planned maintenance (e.g., every Sunday 2–4 AM for DB backups). Automatically expire.
Dependency-aware grouping: Build a service dependency map and suppress leaf alerts when the root dependency is alerting — similar to inhibition but derived from topology rather than manually configured rules.
Alert review cadence: Monthly audit of firing rate per alert rule. Rules with high false-positive rates get tuned (raise threshold, increase for duration) or deleted. Alerts that never fire get reviewed for relevance.

Trade-offs and Failure Modes

Alertmanager HA: Run multiple Alertmanager instances in a cluster using a gossip protocol (Memberlist) for deduplication state. All instances receive all alerts; gossip ensures only one sends the notification.
Metrics store outage: If Prometheus is down, alert evaluation stops. “Dead man's switch” pattern: a Watchdog alert always fires. An external service (Deadman's Snitch) expects a heartbeat from this alert; if the heartbeat stops, the external service pages the team.
Notification channel outage: PagerDuty down during an incident. Maintain a secondary receiver (direct SMS, backup email) for critical alerts. Test the backup path monthly.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How does alert pending duration prevent flapping?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “An alert rule transitions to PENDING state when its threshold is first breached, and only promotes to FIRING after remaining in PENDING for a configurable for duration (e.g., 5 minutes). This debounce window absorbs transient spikes that self-resolve quickly, preventing noisy notifications for conditions that are not genuinely sustained.”
}
},
{
“@type”: “Question”,
“name”: “How are alerts deduplicated using fingerprints?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “An alert fingerprint is computed as a deterministic hash of the alert's label set (e.g., alertname + instance + severity), so multiple evaluations of the same firing condition produce an identical fingerprint. The Alertmanager uses this fingerprint as a deduplication key, grouping repeated notifications into a single active alert entry and suppressing redundant pages.”
}
},
{
“@type”: “Question”,
“name”: “How do inhibition rules suppress child alerts during parent outage?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “An inhibition rule specifies a source matcher (e.g., alertname=NodeDown) and a target matcher (e.g., severity=warning), so when the source alert is active, all matching target alerts are suppressed before routing. This prevents alert storms where dozens of application-level warnings fire as downstream symptoms of a single infrastructure outage, reducing on-call noise.”
}
},
{
“@type”: “Question”,
“name”: “How does an on-call escalation policy work?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “An escalation policy defines an ordered sequence of targets (individual, team, or schedule) each with an acknowledgment timeout; if the primary on-call does not acknowledge within the window, the alert is automatically escalated to the next tier. Escalation state is persisted so that restarts of the alerting service do not reset the escalation clock, and repeat intervals cap the maximum notification frequency per level.”
}
}
]
}