Low Level Design: Metrics Monitoring System

Metric Types

Prometheus defines four core metric types. Counter: a monotonically increasing integer that never decreases except on process restart. Use rate() to compute per-second throughput from a counter. Suitable for request counts, error counts, bytes sent. Gauge: a value that can go up or down arbitrarily. Use for current CPU utilization, memory usage, queue depth, active connections. Histogram: records observations into pre-defined buckets, also tracking count and sum. Use histogram_quantile() to compute percentiles (p50, p95, p99) from histogram data—this is the correct way to measure latency distributions in distributed systems where averaging is misleading. Summary: computes quantiles client-side; less flexible than histograms because quantiles cannot be aggregated across instances.

Pull vs Push Model

Prometheus uses a pull model: it scrapes an HTTP /metrics endpoint exposed by each service every 15 seconds. The endpoint returns metrics in Prometheus text format. This model has a key operational advantage: if Prometheus cannot scrape a target, the target is definitively down—there’s no ambiguity about whether silence means health or network failure. Service discovery (Kubernetes API, Consul, or file-based SD) provides Prometheus with the list of targets to scrape dynamically. The alternative is a push model used by StatsD and Graphite: services emit metrics to a central aggregator. Push works well for short-lived jobs (use Prometheus Pushgateway) and environments where services can’t expose HTTP endpoints, but introduces the problem of distinguishing "service is healthy and silent" from "service is down."

Service Discovery

Static scrape configs don’t scale past a few dozen services. Prometheus integrates with dynamic service discovery backends. With Kubernetes SD, Prometheus queries the Kubernetes API server for pods, services, and endpoints matching a selector, and automatically adds or removes scrape targets as pods scale up and down. Each discovered target carries labels from Kubernetes metadata: job, instance, namespace, pod, container. Relabeling rules (in scrape_configs) transform and filter these labels before metrics are stored: drop targets in certain namespaces, rewrite instance labels to human-readable service names, add environment labels from pod annotations. Consul SD works similarly for non-Kubernetes environments. File-based SD provides a middle ground: a separate process writes JSON target files that Prometheus watches for changes.

Time Series Storage

Prometheus stores time series in its embedded TSDB. Data is written to a write-ahead log (WAL) first for durability, then accumulated in memory for 2 hours before being flushed to an immutable on-disk block. Each block is a directory containing chunks (compressed samples), an index mapping label sets to series IDs, and metadata. The compaction process merges small blocks into larger ones over time, reducing disk I/O for range queries. Local TSDB retention is limited (default 15 days) and single-node. For multi-cluster federation and long-term storage, Thanos or Cortex extend Prometheus: Thanos sidecar processes run alongside each Prometheus instance, uploading 2-hour blocks to object storage (S3/GCS) as they’re created. Thanos Querier federates queries across multiple Prometheus instances and object storage, providing a global view with automatic deduplication of replicated data.

PromQL Query Language

PromQL is a functional query language for time series data. Key patterns every engineer should know: rate(http_requests_total[5m]) computes the per-second request rate over a 5-minute sliding window from a counter—always use rate() on counters, never subtract raw values. histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) computes the p99 latency from a histogram. sum by (service) (rate(errors_total[5m])) aggregates error rate across all instances of each service. sum(rate(errors_total[5m])) / sum(rate(requests_total[5m])) computes overall error ratio for SLO calculation. Avoid high-cardinality label values in queries (e.g., user_id as a label)—they create millions of unique time series and degrade Prometheus performance significantly.

Alerting with AlertManager

Alert rules are defined in Prometheus configuration as PromQL expressions with a duration threshold: the condition must be true for the full duration before the alert fires (avoids flapping on transient spikes). Prometheus evaluates all alert rules every 15 seconds. When an alert fires, Prometheus sends it to AlertManager. AlertManager handles deduplication (same alert from multiple Prometheus instances fires once), grouping (related alerts batched into a single notification), inhibition (suppress downstream alerts when upstream service is down), and routing (different alert labels route to different receivers). Critical alerts go to PagerDuty and page the on-call engineer. Warning-level alerts go to a Slack channel for async review. Silence rules in AlertManager suppress known-good alert conditions during maintenance windows.

Dashboard Design

Grafana is the standard dashboard layer. Two systematic approaches to dashboard layout: the USE method (Utilization, Saturation, Errors) applies to resource metrics—for each infrastructure resource (CPU, memory, disk, network) show current utilization percentage, saturation (queue depth or wait time indicating resource is overloaded), and error rate. The RED method (Rate, Errors, Duration) applies to service metrics—for each service show request rate, error rate, and latency distribution (p50/p95/p99). Every service should have a RED dashboard as a minimum. SLI/SLO tracking panels show the error budget: remaining budget as a percentage, burn rate over the last hour and 6 hours (multi-window alerting), and a projection of when the budget will be exhausted at the current burn rate. Dashboards should link to runbooks and relevant log queries for each panel.

Long-Term Storage

Prometheus local storage is unsuitable for long-term retention and cross-cluster queries. Thanos solves both: the Thanos sidecar runs in the same pod as Prometheus and uploads completed 2-hour TSDB blocks to object storage as they’re created. The Thanos Store Gateway exposes blocks from object storage via the gRPC Store API. The Thanos Querier accepts PromQL queries and fans them out to all Store API endpoints (live Prometheus instances + Store Gateways), merges results, and deduplicates data from replicated Prometheus instances using the replica label. This gives you a single query endpoint with a global view of all clusters and effectively unlimited retention at object storage cost. Thanos Compactor runs against object storage to downsample old data (5-minute resolution after 30 days, 1-hour resolution after 1 year) to keep long-range queries fast.

Scroll to Top