What Is an Anomaly Detection Service?
An anomaly detection service monitors time-series metrics and identifies values that deviate significantly from expected behavior, using statistical baselines, Z-score and IQR methods, and seasonal decomposition to distinguish real anomalies from noise.
Anomaly Types
- Point anomaly: A single data point is far from the rest of the distribution.
- Contextual anomaly: A value is normal globally but anomalous for its specific context (e.g., time of day, region).
- Collective anomaly: A sequence of data points is anomalous as a group, even if individual points appear normal.
Baseline Computation
Compute rolling 30-day mean and standard deviation per metric+tags combination, bucketed by hour-of-week (168 buckets, one per hour across Monday-Sunday):
MetricBaseline (
metric_name VARCHAR(255),
tags_hash VARCHAR(64), -- hash of sorted tag key=value pairs
hour_bucket INT, -- 0-167 (day*24 + hour)
mean DOUBLE PRECISION,
stddev DOUBLE PRECISION,
q1 DOUBLE PRECISION,
q3 DOUBLE PRECISION,
sample_count INT,
updated_at TIMESTAMPTZ,
PRIMARY KEY (metric_name, tags_hash, hour_bucket)
)
Z-Score Detection
Compare the current value against the hour-of-week baseline:
z_score = |current_value - mean| / stddev
if z_score > 3.0:
flag as anomaly
A threshold of 3.0 captures values beyond three standard deviations, covering 99.7% of the normal distribution.
IQR Detection
Use interquartile range for distributions that are not normally distributed:
IQR = Q3 - Q1
lower_fence = Q1 - 1.5 * IQR
upper_fence = Q3 + 1.5 * IQR
if value upper_fence:
flag as anomaly
Seasonal Decomposition (STL)
For metrics with strong weekly or daily seasonality, apply STL (Seasonal and Trend decomposition using Loess):
observed = trend + seasonality + residual
-- Detect anomalies only in the residual component
residual_z = |residual| / stddev(residual)
if residual_z > 3.0:
flag as anomaly
This prevents seasonal spikes (e.g., Monday morning traffic) from being misclassified as anomalies.
Anomaly Table
Anomaly (
id UUID PRIMARY KEY,
metric_name VARCHAR(255),
tags JSONB,
value DOUBLE PRECISION,
expected_value DOUBLE PRECISION,
z_score DOUBLE PRECISION,
detected_at TIMESTAMPTZ,
status VARCHAR(20) -- open | acknowledged | resolved
)
Alert Routing
When an anomaly is recorded, the alert routing layer:
- Matches the anomaly against configured alert rules (metric name + tag filters).
- Checks for active suppression windows (maintenance, deployments).
- Sends a notification via PagerDuty or Slack including metric name, observed value, expected value, and a link to a chart URL showing the surrounding window.
AlertRule (
id UUID PRIMARY KEY,
metric_name VARCHAR(255),
tag_filters JSONB,
channel VARCHAR(50), -- pagerduty | slack
destination VARCHAR(255),
enabled BOOLEAN
)
Suppression Windows
Maintenance windows mute alerts for specified metrics and time ranges, preventing alert storms during planned outages or deployments.
SuppressionWindow (
id UUID PRIMARY KEY,
metric_name VARCHAR(255), -- null = all metrics
starts_at TIMESTAMPTZ,
ends_at TIMESTAMPTZ,
reason TEXT
)
Key Design Considerations
- Baselines are recomputed incrementally using exponential moving averages rather than full recalculation.
- New metrics with fewer than 7 days of data are excluded from alerting to avoid cold-start false positives.
- Anomaly status transitions (open → acknowledged → resolved) are append-only for audit trail purposes.
- Chart URLs embed the metric name, tags, and time window so on-call engineers see immediate context without additional queries.
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What are the main types of anomalies in metric monitoring?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The three main types are point anomalies (a single outlier value), contextual anomalies (a value that is normal globally but anomalous for its context such as time of day), and collective anomalies (a sequence of values that is anomalous as a group even if individual points appear normal).”
}
},
{
“@type”: “Question”,
“name”: “How does Z-score anomaly detection work?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Z-score detection computes the absolute difference between the current metric value and the baseline mean, divided by the standard deviation. If the result exceeds 3.0, the value is flagged as an anomaly, representing a deviation beyond three standard deviations from normal.”
}
},
{
“@type”: “Question”,
“name”: “Why use seasonal decomposition for anomaly detection?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Metrics with weekly or daily seasonality (such as web traffic peaking on weekdays) would generate false positives if evaluated against a flat baseline. STL decomposition separates trend, seasonality, and residual components, allowing anomaly detection to run only on the residual, which eliminates seasonal patterns from the signal.”
}
},
{
“@type”: “Question”,
“name”: “How are anomaly alerts suppressed during maintenance windows?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Maintenance windows are stored with a start time, end time, and optional metric name filter. Before routing an anomaly alert, the system checks for overlapping suppression windows. If a match is found, the alert is muted, preventing alert storms during planned outages or deployments.”
}
}
]
}
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering
See also: Databricks Interview Guide 2026: Spark Internals, Delta Lake, and Lakehouse Architecture