What Is an Anomaly Detection Service?
An anomaly detection service monitors time-series metrics and identifies values that deviate significantly from expected behavior, using statistical baselines, Z-score and IQR methods, and seasonal decomposition to distinguish real anomalies from noise.
Anomaly Types
- Point anomaly: A single data point is far from the rest of the distribution.
- Contextual anomaly: A value is normal globally but anomalous for its specific context (e.g., time of day, region).
- Collective anomaly: A sequence of data points is anomalous as a group, even if individual points appear normal.
Baseline Computation
Compute rolling 30-day mean and standard deviation per metric+tags combination, bucketed by hour-of-week (168 buckets, one per hour across Monday-Sunday):
MetricBaseline (
metric_name VARCHAR(255),
tags_hash VARCHAR(64), -- hash of sorted tag key=value pairs
hour_bucket INT, -- 0-167 (day*24 + hour)
mean DOUBLE PRECISION,
stddev DOUBLE PRECISION,
q1 DOUBLE PRECISION,
q3 DOUBLE PRECISION,
sample_count INT,
updated_at TIMESTAMPTZ,
PRIMARY KEY (metric_name, tags_hash, hour_bucket)
)
Z-Score Detection
Compare the current value against the hour-of-week baseline:
z_score = |current_value - mean| / stddev
if z_score > 3.0:
flag as anomaly
A threshold of 3.0 captures values beyond three standard deviations, covering 99.7% of the normal distribution.
IQR Detection
Use interquartile range for distributions that are not normally distributed:
IQR = Q3 - Q1
lower_fence = Q1 - 1.5 * IQR
upper_fence = Q3 + 1.5 * IQR
if value upper_fence:
flag as anomaly
Seasonal Decomposition (STL)
For metrics with strong weekly or daily seasonality, apply STL (Seasonal and Trend decomposition using Loess):
observed = trend + seasonality + residual
-- Detect anomalies only in the residual component
residual_z = |residual| / stddev(residual)
if residual_z > 3.0:
flag as anomaly
This prevents seasonal spikes (e.g., Monday morning traffic) from being misclassified as anomalies.
Anomaly Table
Anomaly (
id UUID PRIMARY KEY,
metric_name VARCHAR(255),
tags JSONB,
value DOUBLE PRECISION,
expected_value DOUBLE PRECISION,
z_score DOUBLE PRECISION,
detected_at TIMESTAMPTZ,
status VARCHAR(20) -- open | acknowledged | resolved
)
Alert Routing
When an anomaly is recorded, the alert routing layer:
- Matches the anomaly against configured alert rules (metric name + tag filters).
- Checks for active suppression windows (maintenance, deployments).
- Sends a notification via PagerDuty or Slack including metric name, observed value, expected value, and a link to a chart URL showing the surrounding window.
AlertRule (
id UUID PRIMARY KEY,
metric_name VARCHAR(255),
tag_filters JSONB,
channel VARCHAR(50), -- pagerduty | slack
destination VARCHAR(255),
enabled BOOLEAN
)
Suppression Windows
Maintenance windows mute alerts for specified metrics and time ranges, preventing alert storms during planned outages or deployments.
SuppressionWindow (
id UUID PRIMARY KEY,
metric_name VARCHAR(255), -- null = all metrics
starts_at TIMESTAMPTZ,
ends_at TIMESTAMPTZ,
reason TEXT
)
Key Design Considerations
- Baselines are recomputed incrementally using exponential moving averages rather than full recalculation.
- New metrics with fewer than 7 days of data are excluded from alerting to avoid cold-start false positives.
- Anomaly status transitions (open → acknowledged → resolved) are append-only for audit trail purposes.
- Chart URLs embed the metric name, tags, and time window so on-call engineers see immediate context without additional queries.
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering
See also: Databricks Interview Guide 2026: Spark Internals, Delta Lake, and Lakehouse Architecture