Alerting System: Goals and Challenges
An alerting system monitors metrics and fires notifications when conditions violate defined thresholds. The central challenges are minimizing false positives (alert fatigue kills on-call teams), ensuring no true positive is missed (SLA breach), and routing alerts to the right people with enough context to act immediately.
Alert Rule Schema
alert_rules (
id BIGSERIAL PRIMARY KEY,
name TEXT, -- "HighErrorRate"
expr TEXT, -- PromQL: error_rate > 0.05
for_duration INTERVAL, -- must be true for this long before firing
severity TEXT, -- critical, warning, info
labels JSONB, -- {team: "payments", service: "checkout"}
annotations JSONB, -- {summary: "Error rate above 5%", runbook: "https://..."}
receiver TEXT -- routing key
)
The for duration is the most important parameter for avoiding flapping. A spike that lasts 30 seconds does not page anyone. A condition sustained for 5 minutes means something is genuinely wrong.
Alert State Machine
Each alert rule instance (per unique label set) moves through states:
- INACTIVE: Condition is not met. No alert.
- PENDING: Condition is met, but
forduration has not elapsed. Alertmanager is not notified yet. - FIRING: Condition has been met continuously for the full
forduration. Notification sent. - RESOLVED: Condition is no longer met after being FIRING. Resolution notification sent.
The PENDING state is what prevents paging on transient spikes. It also means alerts have inherent detection latency equal to the evaluation interval plus the for duration — factor this into SLA definitions.
Alert Evaluation Pipeline
The evaluation loop runs on a configurable schedule (typically every 15–60 seconds):
- Query the metrics store with the rule's PromQL expression.
- For each result label set, compare against the threshold.
- Update the state machine for that (rule, label_set) instance.
- Emit FIRING or RESOLVED events to Alertmanager.
Rules are evaluated in parallel groups. Rules within a group share the same evaluation interval and are evaluated sequentially to avoid partial-state reads (e.g., a recording rule result must be written before a rule that reads it).
Deduplication via Fingerprinting
An alert fingerprint is derived from the alert name and its label set: fingerprint = hash(alertname + sorted_labels). Alertmanager uses the fingerprint to deduplicate: if an alert with the same fingerprint is already FIRING, do not send another notification. This prevents notification storms when the evaluation loop fires repeatedly for the same ongoing condition.
Alert Grouping
When many alerts fire simultaneously (e.g., a datacenter goes down), sending one notification per alert floods on-call with noise. Grouping combines related alerts into a single notification.
route:
group_by: [alertname, datacenter]
group_wait: 30s # wait for more alerts before sending
group_interval: 5m # send updates every 5 minutes if group changes
repeat_interval: 4h # resend if still firing after 4 hours
A “datacenter down” group might contain 50 individual service alerts, but the on-call receives one message: “50 alerts firing in dc-east, root cause: datacenter outage.”
Inhibition Rules
Inhibition suppresses child alerts when a parent alert is firing, preventing redundant pages. Example: if a host is down (parent), suppress all service alerts on that host (children) — the engineer already knows the host is down.
inhibit_rules:
- source_match:
alertname: HostDown
target_match:
severity: warning
equal: [datacenter, host]
The equal field ensures inhibition only applies when source and target alerts share the same label values — a host down in dc-east does not inhibit alerts from dc-west.
Routing Tree
Alertmanager routes alerts through a tree of matchers to receivers. Each node matches on labels and forwards to a receiver (PagerDuty, Slack, email, webhook) or a child route for more specific matching.
route:
receiver: default-slack
routes:
- match: {severity: critical}
receiver: pagerduty-critical
routes:
- match: {team: payments}
receiver: pagerduty-payments
- match: {severity: warning}
receiver: slack-warnings
Critical payments alerts go to the payments team's PagerDuty. Critical alerts from other teams go to the general critical PagerDuty. Warnings go to Slack for async review.
On-Call Rotation and Escalation
On-call schedules define who is primary and secondary for each time window. Escalation policies define what happens if the primary does not acknowledge within N minutes:
- 0 min: Page primary on-call via PagerDuty mobile push + phone call.
- 5 min: No ack → page secondary on-call.
- 15 min: No ack → page engineering manager + team Slack channel.
- 30 min: No ack → page VP Engineering + incident commander.
Acknowledgment stops escalation. Resolution closes the incident and records MTTA (mean time to acknowledge) and MTTR (mean time to resolve) metrics.
Alert Fatigue Prevention
- Silences: Temporarily suppress alerts for known maintenance or known issues being worked. Time-limited with a required comment and owner.
- Maintenance windows: Schedule recurring silences for planned maintenance (e.g., every Sunday 2–4 AM for DB backups). Automatically expire.
- Dependency-aware grouping: Build a service dependency map and suppress leaf alerts when the root dependency is alerting — similar to inhibition but derived from topology rather than manually configured rules.
- Alert review cadence: Monthly audit of firing rate per alert rule. Rules with high false-positive rates get tuned (raise threshold, increase
forduration) or deleted. Alerts that never fire get reviewed for relevance.
Trade-offs and Failure Modes
- Alertmanager HA: Run multiple Alertmanager instances in a cluster using a gossip protocol (Memberlist) for deduplication state. All instances receive all alerts; gossip ensures only one sends the notification.
- Metrics store outage: If Prometheus is down, alert evaluation stops. “Dead man's switch” pattern: a Watchdog alert always fires. An external service (Deadman's Snitch) expects a heartbeat from this alert; if the heartbeat stops, the external service pages the team.
- Notification channel outage: PagerDuty down during an incident. Maintain a secondary receiver (direct SMS, backup email) for critical alerts. Test the backup path monthly.
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering
See also: Anthropic Interview Guide 2026: Process, Questions, and AI Safety
See also: Atlassian Interview Guide