Design a Monitoring and Alerting System

Design a monitoring and alerting system like Datadog, Prometheus + Grafana, or New Relic. This is a system design problem that tests your understanding of observability — the three pillars of metrics, logs, and traces — and the infrastructure required to collect, store, and act on telemetry data at scale.

Requirements Clarification

  • Data types: Metrics (numeric time series: CPU %, request latency p99, error rate), logs (structured event records), distributed traces (request spans across services).
  • Scale: 10,000 services, each emitting 1,000 metrics every 10 seconds → 1M metric writes/sec. 100TB of logs per day. 1B trace spans per day.
  • Alerting: Evaluate alert conditions continuously. Page on-call engineers via PagerDuty/OpsGenie within 30 seconds of a threshold breach.
  • Retention: High-resolution data (10s intervals) for 15 days. Downsampled data (1m, 5m, 1h) for 1 year. Logs: 30 days searchable, 1 year archived.
  • Dashboards: Query and visualize metrics with sub-second response for interactive exploration.

The Three Pillars of Observability

Metrics: Numeric values measured at regular intervals. Cheap to store (just a float + timestamp), fast to query aggregates, but low cardinality — you can’t answer “which specific user_id had high latency?”

Logs: Event records with arbitrary structure. High cardinality — every event is unique. Expensive to store and search at scale. Answers “what happened exactly” but requires searching unstructured data.

Traces: A tree of spans representing a request flowing through multiple services. Shows which service added latency, which database call was slow. Critical for microservice debugging. The OpenTelemetry project standardizes all three formats.

Architecture

Services
  ├─ [Metrics] StatsD/Prometheus client → Collectors → TSDB (Prometheus/VictoriaMetrics)
  ├─ [Logs]    Log agent (Fluent Bit) → Kafka → Log indexer (Elasticsearch/Loki)
  └─ [Traces]  OpenTelemetry SDK → Collector → Jaeger/Tempo

TSDB ──────────────→ Alert Manager → Alert Router → PagerDuty / Slack / OpsGenie
                   → Query Engine  → Grafana dashboards

Metrics Collection: Push vs Pull

Pull model (Prometheus): The monitoring server periodically scrapes an HTTP endpoint (/metrics) exposed by each service. Simple, no agent required. The monitoring server knows exactly which targets exist and their health. Scales to tens of thousands of targets with federation (regional Prometheus instances → global aggregator).

Push model (StatsD, InfluxDB line protocol): Services send metrics to a collection endpoint. Services behind NAT or with ephemeral IPs (batch jobs, serverless functions) can’t be scraped — push is the only option here. Also useful when services can’t expose HTTP endpoints.

Production systems use both: pull for long-running services, push for batch jobs and short-lived functions. Prometheus supports both via pushgateway.

# Prometheus /metrics endpoint format (OpenMetrics)
# TYPE http_requests_total counter
http_requests_total{method="GET",status="200"} 10234
http_requests_total{method="POST",status="500"} 12
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{le="0.1"} 8932
http_request_duration_seconds_bucket{le="0.5"} 10100
http_request_duration_seconds_bucket{le="1.0"} 10230
http_request_duration_seconds_sum 1547.3
http_request_duration_seconds_count 10234

Time-Series Database (TSDB)

Metrics are stored in a purpose-built TSDB. Requirements: high write throughput (1M writes/sec), efficient range queries (give me CPU % for all hosts from 2h ago to now), and compression (time series data compresses 10–100× with delta-delta + XOR encoding).

Prometheus TSDB: Local, single-node, append-only block storage. Handles tens of millions of series. Not horizontally scalable — use Thanos or Cortex for distributed scale.

VictoriaMetrics: Drop-in Prometheus replacement, 3–10× more storage efficient, natively distributed. Used at large scale (Cloudflare, Adidas).

InfluxDB / TimescaleDB: Good for mixed workloads (metrics + events), SQL-like query language. TimescaleDB is PostgreSQL with time-series extensions — useful when you want ACID for other data alongside metrics.

Downsampling: high-resolution data after 15 days is rolled up to 1-minute averages, then 5-minute after 90 days, then 1-hour after 1 year. Recording rules in Prometheus precompute expensive aggregations during ingestion so dashboard queries are fast.

Log Pipeline

Application → Log Agent (Fluent Bit / Filebeat)
                → Kafka (buffer against indexer backpressure)
                   → Log Indexer (Elasticsearch / OpenSearch / Grafana Loki)
                      → Object Storage (S3 for long-term archive)

Kafka decouples collection from indexing. If Elasticsearch is overwhelmed, logs buffer in Kafka (with retention set to 24h) rather than being dropped. This is the same pattern used in Message Queue architectures for backpressure handling.

Elasticsearch vs Loki: Elasticsearch indexes every field — flexible search, high storage cost (2–5× raw log size). Grafana Loki only indexes labels (service, environment, level) and stores raw log lines compressed — 10× cheaper but you can only filter by label, not arbitrary field values. Loki is the right default for high-volume logs when you don’t need full-text search.

Alert Design

Alerts are evaluated rules that fire when a metric crosses a threshold. Three types:

  • Threshold alerts: error_rate > 1% for 5 minutes. Simple, predictable. Suffers from alert fatigue if threshold isn’t tuned.
  • Anomaly detection: Alert when value deviates from predicted range (based on historical seasonality). Catches problems that have no fixed threshold — a 5% error rate might be fine at 3am but critical at noon. Requires ML model (SARIMA, Prophet, or simpler rolling z-score).
  • Composite alerts: error_rate > 0.5% AND latency_p99 > 2s. Reduces false positives by requiring multiple signals.
# Prometheus alert rule (YAML)
groups:
  - name: api_alerts
    rules:
      - alert: HighErrorRate
        expr: |
          rate(http_requests_total{status=~"5.."}[5m])
          / rate(http_requests_total[5m]) > 0.01
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Error rate {{ $value | humanizePercentage }} on {{ $labels.service }}"
          runbook_url: "https://wiki.internal/runbooks/high-error-rate"

Alert Routing and On-Call

Alerts route through a notification pipeline:

  1. Deduplication: Multiple Prometheus instances may fire the same alert. Deduplicate before paging.
  2. Grouping: Cluster related alerts into one notification (20 hosts with high CPU → one page, not 20).
  3. Routing: Route by team (payments team owns payment service alerts), severity (warning → Slack, critical → PagerDuty), and time of day (business hours → email, night → page).
  4. Silencing: During planned maintenance, silence alerts to prevent false pages.

Prometheus Alertmanager handles dedup, grouping, and routing. PagerDuty/OpsGenie handle on-call schedules, escalation policies, and acknowledgment workflows.

Scale Considerations

  • Cardinality explosion: Each unique combination of labels creates a separate series. A label like user_id with 10M values creates 10M series — the TSDB’s worst nightmare. Never use high-cardinality values (user IDs, request IDs, URLs) as metric labels. Use logs for high-cardinality data.
  • Query federation: For global dashboards, a “meta-Prometheus” federates from regional instances. Thanos/Cortex provides globally available query with object storage backend for long-term retention.
  • Hot path latency: Alert evaluation must be fast. Prometheus evaluates rules every 15–60 seconds. Recording rules precompute expensive aggregations to keep evaluation cheap.

Interview Follow-ups

  • How do you monitor the monitoring system itself? (Recursive: a separate minimal monitoring stack watches the primary one.)
  • How do you handle alert fatigue — too many alerts causing engineers to ignore them?
  • Design the runbook automation layer: alert fires, triggers a Lambda that attempts auto-remediation before paging a human.
  • How does distributed tracing work for a request that spans 15 microservices? What’s the overhead per span?
  • How would you build a cost attribution system — which team owns the infrastructure costs driving each metric spike?

Related System Design Topics

  • Message Queues — Kafka decouples log collection from indexing; the same backpressure patterns apply to metrics pipelines
  • Database Sharding — sharding TSDB storage by metric name + time range for parallel ingest and query
  • Design a Distributed Key-Value Store — Redis as the short-term alert state store and deduplication cache
  • Load Balancing — distributing Prometheus scrape load and dashboard query traffic
  • Caching Strategies — recording rules precompute expensive aggregations; identical to materialized view caching

See also: How to Detect Model Drift in Production — apply the same Prometheus/TSDB monitoring infrastructure to ML model health: PSI, KS tests, and prediction distribution alerts.

See also: MLOps Interview Questions — ML model monitoring is built on the same infrastructure; model metrics (AUC, prediction PSI) are time-series data pushed to Prometheus.

Scroll to Top