Alerting System: Goals and Challenges
An alerting system monitors metrics and fires notifications when conditions violate defined thresholds. The central challenges are minimizing false positives (alert fatigue kills on-call teams), ensuring no true positive is missed (SLA breach), and routing alerts to the right people with enough context to act immediately.
Alert Rule Schema
alert_rules (
id BIGSERIAL PRIMARY KEY,
name TEXT, -- "HighErrorRate"
expr TEXT, -- PromQL: error_rate > 0.05
for_duration INTERVAL, -- must be true for this long before firing
severity TEXT, -- critical, warning, info
labels JSONB, -- {team: "payments", service: "checkout"}
annotations JSONB, -- {summary: "Error rate above 5%", runbook: "https://..."}
receiver TEXT -- routing key
)
The for duration is the most important parameter for avoiding flapping. A spike that lasts 30 seconds does not page anyone. A condition sustained for 5 minutes means something is genuinely wrong.
Alert State Machine
Each alert rule instance (per unique label set) moves through states:
- INACTIVE: Condition is not met. No alert.
- PENDING: Condition is met, but
forduration has not elapsed. Alertmanager is not notified yet. - FIRING: Condition has been met continuously for the full
forduration. Notification sent. - RESOLVED: Condition is no longer met after being FIRING. Resolution notification sent.
The PENDING state is what prevents paging on transient spikes. It also means alerts have inherent detection latency equal to the evaluation interval plus the for duration — factor this into SLA definitions.
Alert Evaluation Pipeline
The evaluation loop runs on a configurable schedule (typically every 15–60 seconds):
- Query the metrics store with the rule's PromQL expression.
- For each result label set, compare against the threshold.
- Update the state machine for that (rule, label_set) instance.
- Emit FIRING or RESOLVED events to Alertmanager.
Rules are evaluated in parallel groups. Rules within a group share the same evaluation interval and are evaluated sequentially to avoid partial-state reads (e.g., a recording rule result must be written before a rule that reads it).
Deduplication via Fingerprinting
An alert fingerprint is derived from the alert name and its label set: fingerprint = hash(alertname + sorted_labels). Alertmanager uses the fingerprint to deduplicate: if an alert with the same fingerprint is already FIRING, do not send another notification. This prevents notification storms when the evaluation loop fires repeatedly for the same ongoing condition.
Alert Grouping
When many alerts fire simultaneously (e.g., a datacenter goes down), sending one notification per alert floods on-call with noise. Grouping combines related alerts into a single notification.
route:
group_by: [alertname, datacenter]
group_wait: 30s # wait for more alerts before sending
group_interval: 5m # send updates every 5 minutes if group changes
repeat_interval: 4h # resend if still firing after 4 hours
A “datacenter down” group might contain 50 individual service alerts, but the on-call receives one message: “50 alerts firing in dc-east, root cause: datacenter outage.”
Inhibition Rules
Inhibition suppresses child alerts when a parent alert is firing, preventing redundant pages. Example: if a host is down (parent), suppress all service alerts on that host (children) — the engineer already knows the host is down.
inhibit_rules:
- source_match:
alertname: HostDown
target_match:
severity: warning
equal: [datacenter, host]
The equal field ensures inhibition only applies when source and target alerts share the same label values — a host down in dc-east does not inhibit alerts from dc-west.
Routing Tree
Alertmanager routes alerts through a tree of matchers to receivers. Each node matches on labels and forwards to a receiver (PagerDuty, Slack, email, webhook) or a child route for more specific matching.
route:
receiver: default-slack
routes:
- match: {severity: critical}
receiver: pagerduty-critical
routes:
- match: {team: payments}
receiver: pagerduty-payments
- match: {severity: warning}
receiver: slack-warnings
Critical payments alerts go to the payments team's PagerDuty. Critical alerts from other teams go to the general critical PagerDuty. Warnings go to Slack for async review.
On-Call Rotation and Escalation
On-call schedules define who is primary and secondary for each time window. Escalation policies define what happens if the primary does not acknowledge within N minutes:
- 0 min: Page primary on-call via PagerDuty mobile push + phone call.
- 5 min: No ack → page secondary on-call.
- 15 min: No ack → page engineering manager + team Slack channel.
- 30 min: No ack → page VP Engineering + incident commander.
Acknowledgment stops escalation. Resolution closes the incident and records MTTA (mean time to acknowledge) and MTTR (mean time to resolve) metrics.
Alert Fatigue Prevention
- Silences: Temporarily suppress alerts for known maintenance or known issues being worked. Time-limited with a required comment and owner.
- Maintenance windows: Schedule recurring silences for planned maintenance (e.g., every Sunday 2–4 AM for DB backups). Automatically expire.
- Dependency-aware grouping: Build a service dependency map and suppress leaf alerts when the root dependency is alerting — similar to inhibition but derived from topology rather than manually configured rules.
- Alert review cadence: Monthly audit of firing rate per alert rule. Rules with high false-positive rates get tuned (raise threshold, increase
forduration) or deleted. Alerts that never fire get reviewed for relevance.
Trade-offs and Failure Modes
- Alertmanager HA: Run multiple Alertmanager instances in a cluster using a gossip protocol (Memberlist) for deduplication state. All instances receive all alerts; gossip ensures only one sends the notification.
- Metrics store outage: If Prometheus is down, alert evaluation stops. “Dead man's switch” pattern: a Watchdog alert always fires. An external service (Deadman's Snitch) expects a heartbeat from this alert; if the heartbeat stops, the external service pages the team.
- Notification channel outage: PagerDuty down during an incident. Maintain a secondary receiver (direct SMS, backup email) for critical alerts. Test the backup path monthly.
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How does alert pending duration prevent flapping?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “An alert rule transitions to PENDING state when its threshold is first breached, and only promotes to FIRING after remaining in PENDING for a configurable for duration (e.g., 5 minutes). This debounce window absorbs transient spikes that self-resolve quickly, preventing noisy notifications for conditions that are not genuinely sustained.”
}
},
{
“@type”: “Question”,
“name”: “How are alerts deduplicated using fingerprints?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “An alert fingerprint is computed as a deterministic hash of the alert's label set (e.g., alertname + instance + severity), so multiple evaluations of the same firing condition produce an identical fingerprint. The Alertmanager uses this fingerprint as a deduplication key, grouping repeated notifications into a single active alert entry and suppressing redundant pages.”
}
},
{
“@type”: “Question”,
“name”: “How do inhibition rules suppress child alerts during parent outage?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “An inhibition rule specifies a source matcher (e.g., alertname=NodeDown) and a target matcher (e.g., severity=warning), so when the source alert is active, all matching target alerts are suppressed before routing. This prevents alert storms where dozens of application-level warnings fire as downstream symptoms of a single infrastructure outage, reducing on-call noise.”
}
},
{
“@type”: “Question”,
“name”: “How does an on-call escalation policy work?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “An escalation policy defines an ordered sequence of targets (individual, team, or schedule) each with an acknowledgment timeout; if the primary on-call does not acknowledge within the window, the alert is automatically escalated to the next tier. Escalation state is persisted so that restarts of the alerting service do not reset the escalation clock, and repeat intervals cap the maximum notification frequency per level.”
}
}
]
}
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering
See also: Anthropic Interview Guide 2026: Process, Questions, and AI Safety
See also: Atlassian Interview Guide