System Design Interview: Design a Metrics and Monitoring System (Prometheus)

What Is a Metrics and Monitoring System?

A metrics system collects numerical measurements from services (request rate, error rate, latency, CPU usage), stores them as time series, and enables querying, visualization, and alerting. Prometheus (pull-based) and Datadog (push-based) are the dominant systems. At scale: collecting millions of metrics per second from thousands of services, retaining months of history.

  • Stripe Interview Guide
  • LinkedIn Interview Guide
  • Uber Interview Guide
  • Databricks Interview Guide
  • Netflix Interview Guide
  • Cloudflare Interview Guide
  • Metrics Types

    • Counter: monotonically increasing value (total requests, total errors). Rate = derivative over time.
    • Gauge: current value, can go up or down (memory usage, active connections, queue depth)
    • Histogram: distribution of values in buckets (request latency in [<10ms, <50ms, <100ms, <500ms, +Inf] buckets)
    • Summary: pre-computed quantiles (p50, p99) on the client side

    Pull vs. Push Architecture

    Pull (Prometheus): scraping model. Prometheus server periodically fetches /metrics endpoint from each service. Advantages: Prometheus controls the scrape rate, easy to detect down services (missing scrapes), simple to debug (just curl /metrics). Disadvantage: doesn’t work for short-lived jobs (batch jobs die before Prometheus scrapes them) — use PushGateway for those.

    Push (Datadog, StatsD): services send metrics to an agent or collector. Advantages: works for short-lived processes, no need to configure Prometheus with all service endpoints. Disadvantage: metric storms (services push too fast), harder to backpressure.

    Time Series Storage

    Each metric is identified by: metric_name + label set (key-value pairs). Example: http_requests_total{method=”GET”, path=”/api/users”, status=”200″}. Time series = sequence of (timestamp, value) pairs.

    Prometheus TSDB (time series database): stores data in blocks of 2 hours. Within a block: chunks of compressed time series. Uses delta encoding for timestamps and XOR encoding for float values (Gorilla compression from Facebook) — compresses 10x vs. raw storage. Each sample: ~1.37 bytes average (vs. 16 bytes raw). Block compaction: background process merges small blocks into larger ones, applying downsampling for long-term retention.

    Querying: PromQL

    # Request rate (per second) over 5-minute window:
    rate(http_requests_total{status="200"}[5m])
    
    # Error ratio:
    rate(http_requests_total{status=~"5.."}[5m])
    / rate(http_requests_total[5m])
    
    # 99th percentile latency:
    histogram_quantile(0.99,
      rate(http_request_duration_seconds_bucket[5m])
    )
    
    # CPU usage per pod:
    sum by (pod) (rate(container_cpu_usage_seconds_total[5m]))
    

    Alerting

    Alert rules defined in Prometheus: evaluate PromQL expressions on a schedule. When condition is true for longer than `for` duration, alert fires. Alertmanager: receives alert notifications, deduplicates (same alert from multiple Prometheus instances), groups related alerts, routes to appropriate receiver (PagerDuty, Slack, email), silences during maintenance windows.

    - alert: HighErrorRate
      expr: rate(http_requests_total{status=~"5.."}[5m])
            / rate(http_requests_total[5m]) > 0.01
      for: 5m
      annotations:
        summary: "Error rate above 1% for 5 minutes"
    

    Long-Term Storage

    Prometheus retains 15 days by default (limited by local disk). For months/years: use Thanos or Cortex — sidecar processes that ship Prometheus blocks to S3/GCS for unlimited retention. Global query layer allows querying across multiple Prometheus instances (multi-cluster view). Downsampling: store 5-minute aggregates for 1 month, 1-hour aggregates for 1 year — dramatically reduces storage for historical data.

    Interview Tips

    • Four golden signals (Google SRE): Latency, Traffic (requests/sec), Errors, Saturation (resource utilization).
    • Rate over range vector: rate() computes per-second average over the window. increase() computes total increase.
    • Cardinality explosion: each unique label combination is a separate time series. Avoid high-cardinality labels (user_id, request_id). 1M series × 15 bytes/sample × 1 sample/15s × 86400s/day = massive storage.
    • Histogram vs. Summary: histogram allows server-side aggregation (sum histograms across replicas); summary quantiles are client-side and cannot be aggregated.

    {
    “@context”: “https://schema.org”,
    “@type”: “FAQPage”,
    “mainEntity”: [
    {
    “@type”: “Question”,
    “name”: “What are the four golden signals and how do you monitor them?”,
    “acceptedAnswer”: { “@type”: “Answer”, “text”: “The Four Golden Signals (Google SRE Book) are the most important metrics for any service: (1) Latency: time to serve a request. Measure percentiles (p50, p95, p99), not averages. High p99 with normal p50 means a subset of users has very bad experience. PromQL: histogram_quantile(0.99, rate(http_request_duration_bucket[5m])). (2) Traffic: demand on your system. Requests per second, messages per second, transactions per second. PromQL: rate(http_requests_total[5m]). (3) Errors: rate of failed requests. 5xx errors, exception rate, failed Kafka consumer messages. PromQL: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]). (4) Saturation: how "full" is your service? CPU, memory, disk, queue depth. How close are you to capacity? PromQL: sum(container_cpu_usage_seconds_total) / sum(kube_node_status_capacity_cpu_cores). Alert on: error rate > 1%, p99 latency > 500ms, saturation > 80%. The signals are ordered: latency and errors are user-facing (impact now); traffic tells you why; saturation predicts future problems.” }
    },
    {
    “@type”: “Question”,
    “name”: “How does Prometheus TSDB compress time series data efficiently?”,
    “acceptedAnswer”: { “@type”: “Answer”, “text”: “Prometheus TSDB (Time Series Database) uses Gorilla-style compression (from Facebook's 2015 paper) achieving ~1.37 bytes per sample vs. 16 bytes raw (timestamp + float64). Timestamp compression: samples arrive at regular intervals (e.g., every 15 seconds). Store the first timestamp explicitly. For subsequent timestamps, store delta from previous timestamp (likely 15). For the delta-of-delta (usually 0 for regular scrapes), store 0 bits if zero, else a small variable-length encoding. Most samples: 0 bits for timestamp delta-of-delta. Value compression: XOR of current and previous float64. Consecutive measurements of a slowly-changing gauge (CPU usage: 0.453, 0.451, 0.454) have nearly identical bit patterns. XOR of similar floats produces leading zeros + small significant portion. Encode with a leading-zero count prefix and only the changed bits. Storage structure: samples grouped into chunks of ~120 samples (30 minutes at 15s interval). Chunks are immutable. Multiple chunks form a block (2-hour window). Block compaction merges overlapping blocks and applies downsampling.” }
    },
    {
    “@type”: “Question”,
    “name”: “How do you design alerting to minimize alert fatigue?”,
    “acceptedAnswer”: { “@type”: “Answer”, “text”: “Alert fatigue occurs when on-call engineers receive too many alerts — many of which are low-severity, flapping, or duplicate. Engineers start ignoring or silencing alerts, which causes real incidents to be missed. Principles for good alerting: (1) Alert on symptoms, not causes. "User-facing error rate > 1%" is actionable. "MySQL slave replication lag" is a cause — only alert if it leads to user impact. (2) Use `for` duration. A momentary spike shouldn't wake someone at 3am. `for: 5m` means the condition must be true continuously for 5 minutes before firing. (3) Severity levels: P1 (pages immediately, service down), P2 (high error rate, pages), P3 (warning, Slack notification). Only P1 and P2 should page. (4) Deduplication and grouping: Alertmanager groups related alerts (same service, same time) into one notification. 100 pods all alerting high memory → one grouped alert. (5) Inhibition rules: when a datacenter is down (P1), suppress lower-severity alerts for services in that datacenter — they're expected. (6) Deadman's switch: alert if no data arrives (the monitoring system itself failing is as dangerous as the monitored system failing).” }
    }
    ]
    }

    Scroll to Top