Low Level Design: Anomaly Detection Service

What Is an Anomaly Detection Service?

An anomaly detection service monitors time-series metrics and identifies values that deviate significantly from expected behavior, using statistical baselines, Z-score and IQR methods, and seasonal decomposition to distinguish real anomalies from noise.

Anomaly Types

  • Point anomaly: A single data point is far from the rest of the distribution.
  • Contextual anomaly: A value is normal globally but anomalous for its specific context (e.g., time of day, region).
  • Collective anomaly: A sequence of data points is anomalous as a group, even if individual points appear normal.

Baseline Computation

Compute rolling 30-day mean and standard deviation per metric+tags combination, bucketed by hour-of-week (168 buckets, one per hour across Monday-Sunday):

MetricBaseline (
  metric_name   VARCHAR(255),
  tags_hash     VARCHAR(64),    -- hash of sorted tag key=value pairs
  hour_bucket   INT,            -- 0-167 (day*24 + hour)
  mean          DOUBLE PRECISION,
  stddev        DOUBLE PRECISION,
  q1            DOUBLE PRECISION,
  q3            DOUBLE PRECISION,
  sample_count  INT,
  updated_at    TIMESTAMPTZ,
  PRIMARY KEY (metric_name, tags_hash, hour_bucket)
)

Z-Score Detection

Compare the current value against the hour-of-week baseline:

z_score = |current_value - mean| / stddev

if z_score > 3.0:
    flag as anomaly

A threshold of 3.0 captures values beyond three standard deviations, covering 99.7% of the normal distribution.

IQR Detection

Use interquartile range for distributions that are not normally distributed:

IQR = Q3 - Q1

lower_fence = Q1 - 1.5 * IQR
upper_fence = Q3 + 1.5 * IQR

if value  upper_fence:
    flag as anomaly

Seasonal Decomposition (STL)

For metrics with strong weekly or daily seasonality, apply STL (Seasonal and Trend decomposition using Loess):

observed = trend + seasonality + residual

-- Detect anomalies only in the residual component
residual_z = |residual| / stddev(residual)
if residual_z > 3.0:
    flag as anomaly

This prevents seasonal spikes (e.g., Monday morning traffic) from being misclassified as anomalies.

Anomaly Table

Anomaly (
  id              UUID PRIMARY KEY,
  metric_name     VARCHAR(255),
  tags            JSONB,
  value           DOUBLE PRECISION,
  expected_value  DOUBLE PRECISION,
  z_score         DOUBLE PRECISION,
  detected_at     TIMESTAMPTZ,
  status          VARCHAR(20)   -- open | acknowledged | resolved
)

Alert Routing

When an anomaly is recorded, the alert routing layer:

  1. Matches the anomaly against configured alert rules (metric name + tag filters).
  2. Checks for active suppression windows (maintenance, deployments).
  3. Sends a notification via PagerDuty or Slack including metric name, observed value, expected value, and a link to a chart URL showing the surrounding window.
AlertRule (
  id            UUID PRIMARY KEY,
  metric_name   VARCHAR(255),
  tag_filters   JSONB,
  channel       VARCHAR(50),   -- pagerduty | slack
  destination   VARCHAR(255),
  enabled       BOOLEAN
)

Suppression Windows

Maintenance windows mute alerts for specified metrics and time ranges, preventing alert storms during planned outages or deployments.

SuppressionWindow (
  id          UUID PRIMARY KEY,
  metric_name VARCHAR(255),  -- null = all metrics
  starts_at   TIMESTAMPTZ,
  ends_at     TIMESTAMPTZ,
  reason      TEXT
)

Key Design Considerations

  • Baselines are recomputed incrementally using exponential moving averages rather than full recalculation.
  • New metrics with fewer than 7 days of data are excluded from alerting to avoid cold-start false positives.
  • Anomaly status transitions (open → acknowledged → resolved) are append-only for audit trail purposes.
  • Chart URLs embed the metric name, tags, and time window so on-call engineers see immediate context without additional queries.

See also: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence

See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering

See also: Databricks Interview Guide 2026: Spark Internals, Delta Lake, and Lakehouse Architecture

Scroll to Top