How does an alerting service evaluate threshold and anomaly detection rules?

Threshold rules compare the current metric value against a static or dynamic threshold on each evaluation tick and fire when the condition holds for a configurable duration (to avoid flapping). Anomaly rules apply statistical models — such as z-score over a rolling baseline or seasonal decomposition — to flag values that deviate significantly from expected behavior without requiring a hand-tuned threshold.

How does an alerting system group and deduplicate related alerts?

Incoming alert events are fingerprinted using a hash of their label set (e.g., service, environment, alertname). Events with the same fingerprint within a grouping window are collapsed into a single alert group rather than generating separate notifications. Grouping keys can be configured per route (e.g., group all database alerts for the same cluster together) to reduce notification fatigue.

What is an escalation policy in an alerting service?

An escalation policy defines an ordered list of targets and the time to wait before advancing to the next target when an alert goes unacknowledged. For example: notify the on-call engineer first; if unacknowledged after 5 minutes, notify the team lead; after another 10 minutes, page the engineering manager. Policies are attached to alert routes and can be shared across multiple alert rules.

How does multi-channel notification routing work in an alerting service?

A routing tree matches incoming alert groups against a set of matchers (label selectors) and directs each group to one or more receiver configurations. Each receiver specifies one or more notification channel integrations — PagerDuty, Slack, email, SMS — with channel-specific templates. The router evaluates matchers top-down, allowing fine-grained rules (e.g., critical alerts in production go to PagerDuty; warning alerts go to Slack only).

Alerting Service Low-Level Design: Threshold Rules, Alert Grouping, and Notification Routing

⏱ 6 min read

What Is an Alerting Service?

An alerting service continuously evaluates metric streams against configured rules, fires alert notifications when conditions are met, groups related alerts to reduce noise, and routes notifications through the appropriate channels to the right recipients. It sits between the monitoring system (which collects metrics) and the notification channels (PagerDuty, Slack, email, SMS), applying intelligence around deduplication, grouping, and escalation so on-call engineers receive actionable signals rather than alert floods.

Requirements

Functional Requirements

Define alert rules with threshold-based conditions (metric exceeds value for duration) or anomaly detection (metric deviates from baseline by a configured percentage).
Evaluate rules against incoming metric data at configurable intervals.
Group firing alerts that share configured labels (e.g. service, environment) into a single grouped alert.
Deduplicate repeated alerts for the same rule within a configurable repeat interval.
Route grouped alerts to notification channels based on configurable routing policies.
Apply escalation policies: if an alert is not acknowledged within N minutes, route to the next tier.

Non-Functional Requirements

Rule evaluation must complete within 10 seconds of the configured evaluation interval.
Alert state must be persisted so service restarts do not cause spurious firing or missed alerts.
Support 10,000 active rules evaluating against a shared metric stream.
Notification delivery must achieve at-least-once semantics.

Data Model

AlertRule

rule_id (UUID), name, owner_id
metric_query (string: PromQL expression or equivalent)
condition (JSONB: operator (gt/lt/eq), threshold, duration_seconds)
labels (JSONB: key-value map for grouping, e.g. {service: “payments”, env: “prod”})
severity (ENUM: info, warning, critical)
evaluation_interval_seconds (integer)
for_duration_seconds (integer: how long condition must hold before firing)

AlertInstance

instance_id (UUID), rule_id
fingerprint (SHA-256 of rule_id + label values: used for deduplication)
state (ENUM: pending, firing, resolved)
started_at, resolved_at
last_notified_at

AlertGroup

group_id (UUID)
group_key (hash of grouping label values)
instance_ids (array of AlertInstance IDs in this group)
route_id (foreign key to the matched notification route)
acknowledged_at, acknowledged_by

Core Algorithms

Rule Evaluation

Each rule is assigned to an evaluation worker. The worker queries the metrics backend at the configured interval, evaluates the condition, and manages the alert state machine:

If the condition is true and no AlertInstance exists: create an instance in state PENDING.
If PENDING and the condition has held for for_duration_seconds: transition to FIRING and trigger grouping.
If FIRING and the condition becomes false: transition to RESOLVED and notify.
If PENDING and the condition clears before for_duration_seconds: delete the instance without firing.

Alert Grouping and Deduplication

When an instance transitions to FIRING, the grouping engine computes a group_key by hashing the values of the grouping labels defined in the route configuration. If an AlertGroup with that group_key exists and is within the active window, the new instance is added to it. Otherwise a new group is created. The group is notified only once per group_wait period (default 30 seconds for initial fire) and once per group_interval period (default 5 minutes for subsequent updates). This batches related alerts into a single notification.

Deduplication is handled via the fingerprint field: if an instance with the same fingerprint is already FIRING, no duplicate notification is sent until repeat_interval (default 4 hours) has elapsed since last_notified_at.

Anomaly Detection Rules

For anomaly rules, the condition replaces a fixed threshold with a dynamic baseline computed over a lookback window (e.g. same hour in the past 7 days). The evaluation worker computes the baseline mean and standard deviation from historical metric data, then fires if the current value deviates by more than a configured number of standard deviations (typically 3-sigma).

Scalability

Rules are distributed across evaluation workers using consistent hashing on rule_id. Each worker owns a subset of rules and maintains their alert state in memory, persisting state changes to PostgreSQL on each transition. Worker membership is tracked via a service registry (etcd or Consul). When a worker joins or leaves, rules are resharded and new owners reload state from PostgreSQL before beginning evaluation.

API Design

POST /rules — create a new alert rule.
GET /rules/{rule_id}/state — return current evaluation state and any firing instances.
GET /alerts — list all currently firing alert groups with pagination and label filters.
POST /alerts/{group_id}/acknowledge — acknowledge a group, suppressing escalation for a configured period.
POST /routes — create a routing policy mapping label selectors to notification channels.

Failure Modes

Metrics backend unavailable: Rule evaluation is skipped for the affected interval. The alert state machine does not advance (no spurious FIRING or RESOLVED transitions). An internal alert fires when the backend is unreachable for more than 2 consecutive evaluation cycles.
Notification channel unreachable: Delivery is retried with exponential backoff. If delivery fails for more than 10 minutes, the alert is escalated to a fallback channel (e.g. email when Slack is down).
Evaluation worker crash: After the heartbeat lease expires, the coordinator reassigns the dead worker rules to surviving workers. Those workers reload state from PostgreSQL and resume evaluation. At-most-one-miss per rule is the target: a single evaluation cycle may be skipped during reassignment.

Observability

Track active alert count by severity, rule evaluation success rate, evaluation latency, notification delivery rate and failure count, deduplication rate (duplicate firings suppressed), and escalation rate. Alert on the alerting service itself: if rule evaluation latency exceeds 2x the evaluation interval, the service is falling behind and rules may miss firing windows.