Low Level Design: Anomaly Detection Service

What Is an Anomaly Detection Service?

An anomaly detection service monitors time-series metrics and identifies values that deviate significantly from expected behavior, using statistical baselines, Z-score and IQR methods, and seasonal decomposition to distinguish real anomalies from noise.

Anomaly Types

  • Point anomaly: A single data point is far from the rest of the distribution.
  • Contextual anomaly: A value is normal globally but anomalous for its specific context (e.g., time of day, region).
  • Collective anomaly: A sequence of data points is anomalous as a group, even if individual points appear normal.

Baseline Computation

Compute rolling 30-day mean and standard deviation per metric+tags combination, bucketed by hour-of-week (168 buckets, one per hour across Monday-Sunday):

MetricBaseline (
  metric_name   VARCHAR(255),
  tags_hash     VARCHAR(64),    -- hash of sorted tag key=value pairs
  hour_bucket   INT,            -- 0-167 (day*24 + hour)
  mean          DOUBLE PRECISION,
  stddev        DOUBLE PRECISION,
  q1            DOUBLE PRECISION,
  q3            DOUBLE PRECISION,
  sample_count  INT,
  updated_at    TIMESTAMPTZ,
  PRIMARY KEY (metric_name, tags_hash, hour_bucket)
)

Z-Score Detection

Compare the current value against the hour-of-week baseline:

z_score = |current_value - mean| / stddev

if z_score > 3.0:
    flag as anomaly

A threshold of 3.0 captures values beyond three standard deviations, covering 99.7% of the normal distribution.

IQR Detection

Use interquartile range for distributions that are not normally distributed:

IQR = Q3 - Q1

lower_fence = Q1 - 1.5 * IQR
upper_fence = Q3 + 1.5 * IQR

if value  upper_fence:
    flag as anomaly

Seasonal Decomposition (STL)

For metrics with strong weekly or daily seasonality, apply STL (Seasonal and Trend decomposition using Loess):

observed = trend + seasonality + residual

-- Detect anomalies only in the residual component
residual_z = |residual| / stddev(residual)
if residual_z > 3.0:
    flag as anomaly

This prevents seasonal spikes (e.g., Monday morning traffic) from being misclassified as anomalies.

Anomaly Table

Anomaly (
  id              UUID PRIMARY KEY,
  metric_name     VARCHAR(255),
  tags            JSONB,
  value           DOUBLE PRECISION,
  expected_value  DOUBLE PRECISION,
  z_score         DOUBLE PRECISION,
  detected_at     TIMESTAMPTZ,
  status          VARCHAR(20)   -- open | acknowledged | resolved
)

Alert Routing

When an anomaly is recorded, the alert routing layer:

  1. Matches the anomaly against configured alert rules (metric name + tag filters).
  2. Checks for active suppression windows (maintenance, deployments).
  3. Sends a notification via PagerDuty or Slack including metric name, observed value, expected value, and a link to a chart URL showing the surrounding window.
AlertRule (
  id            UUID PRIMARY KEY,
  metric_name   VARCHAR(255),
  tag_filters   JSONB,
  channel       VARCHAR(50),   -- pagerduty | slack
  destination   VARCHAR(255),
  enabled       BOOLEAN
)

Suppression Windows

Maintenance windows mute alerts for specified metrics and time ranges, preventing alert storms during planned outages or deployments.

SuppressionWindow (
  id          UUID PRIMARY KEY,
  metric_name VARCHAR(255),  -- null = all metrics
  starts_at   TIMESTAMPTZ,
  ends_at     TIMESTAMPTZ,
  reason      TEXT
)

Key Design Considerations

  • Baselines are recomputed incrementally using exponential moving averages rather than full recalculation.
  • New metrics with fewer than 7 days of data are excluded from alerting to avoid cold-start false positives.
  • Anomaly status transitions (open → acknowledged → resolved) are append-only for audit trail purposes.
  • Chart URLs embed the metric name, tags, and time window so on-call engineers see immediate context without additional queries.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What are the main types of anomalies in metric monitoring?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The three main types are point anomalies (a single outlier value), contextual anomalies (a value that is normal globally but anomalous for its context such as time of day), and collective anomalies (a sequence of values that is anomalous as a group even if individual points appear normal).”
}
},
{
“@type”: “Question”,
“name”: “How does Z-score anomaly detection work?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Z-score detection computes the absolute difference between the current metric value and the baseline mean, divided by the standard deviation. If the result exceeds 3.0, the value is flagged as an anomaly, representing a deviation beyond three standard deviations from normal.”
}
},
{
“@type”: “Question”,
“name”: “Why use seasonal decomposition for anomaly detection?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Metrics with weekly or daily seasonality (such as web traffic peaking on weekdays) would generate false positives if evaluated against a flat baseline. STL decomposition separates trend, seasonality, and residual components, allowing anomaly detection to run only on the residual, which eliminates seasonal patterns from the signal.”
}
},
{
“@type”: “Question”,
“name”: “How are anomaly alerts suppressed during maintenance windows?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Maintenance windows are stored with a start time, end time, and optional metric name filter. Before routing an anomaly alert, the system checks for overlapping suppression windows. If a match is found, the alert is muted, preventing alert storms during planned outages or deployments.”
}
}
]
}

See also: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence

See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering

See also: Databricks Interview Guide 2026: Spark Internals, Delta Lake, and Lakehouse Architecture

Scroll to Top