Prometheus is the standard open-source monitoring system for cloud-native applications, used by thousands of organizations to collect and query metrics. Designing Prometheus itself — the internal TSDB, scraping architecture, PromQL query engine, and Alertmanager — tests your understanding of time-series storage, pull-based collection, and the federation/long-term storage patterns that make Prometheus scale. This guide covers the architecture for infrastructure engineering interviews.
Pull-Based Scraping Architecture
Prometheus pulls metrics from targets (applications, infrastructure) by scraping their /metrics HTTP endpoint at a configured interval (typically 15-30 seconds). Why pull over push: (1) The Prometheus server controls the scrape rate — targets cannot overwhelm it by pushing too fast. (2) If a target is down, the scrape fails immediately — Prometheus knows the target is unhealthy. With push, silence is ambiguous (is the target down or just has nothing to report?). (3) Targets do not need to know where to push — they just expose an endpoint. Service discovery automatically finds targets. Service discovery: Prometheus discovers scrape targets dynamically from: Kubernetes API (pods with specific annotations), Consul service registry, DNS SRV records, EC2/GCE instance lists, and static configuration files. Relabeling: before scraping, relabeling rules transform target labels. Example: add a “team” label based on the Kubernetes namespace, drop targets matching a specific annotation, or rename metric labels. Scrape internals: for each target, Prometheus sends an HTTP GET to /metrics. The target responds with metrics in the Prometheus exposition format: metric_name{label1=”value1″, label2=”value2″} 42.5 1618000000. Each line: metric name, labels, value, and optional timestamp. The response is parsed, samples are appended to the TSDB, and staleness markers are written for missing metrics.
TSDB: Time-Series Database Internals
Prometheus TSDB is a custom time-series database optimized for: high write throughput (millions of samples per second), efficient time-range queries, and label-based filtering. Storage layout: data is organized into 2-hour blocks. Each block contains: an index (mapping from label sets to series, and from series to chunk file positions), chunk files (compressed time-series data), and a tombstones file (marking deleted data). Within each block, each time series (unique combination of metric name + labels) is stored as a series of chunks. Each chunk holds ~120 samples compressed with: delta-of-delta encoding for timestamps (regular 15-second intervals compress to ~1 bit per sample) and XOR encoding for values (consecutive similar values compress to a few bits). Result: ~1.3 bytes per sample on average. Write path: (1) Incoming samples are written to the in-memory “head block” (the current 2-hour window). (2) The head block is also written to a WAL (Write-Ahead Log) for crash recovery. (3) Every 2 hours, the head block is compacted into an immutable on-disk block. (4) Background compaction merges multiple blocks into larger blocks (reducing total blocks and improving query performance). Retention: configured by time (e.g., 15 days) or size (e.g., 50 GB). Old blocks are deleted when the retention limit is exceeded.
PromQL Query Engine
PromQL (Prometheus Query Language) enables powerful time-series queries: (1) Instant queries — evaluate at a single point in time. http_requests_total{method=”GET”, status=”200″} returns the current value for all matching series. (2) Range queries — evaluate over a time range. rate(http_requests_total[5m]) computes the per-second rate over the last 5 minutes. (3) Aggregations — sum, avg, min, max, count, quantile across label dimensions. sum(rate(http_requests_total[5m])) by (service) computes the total request rate per service. (4) Functions — rate (per-second rate from counters), histogram_quantile (compute percentiles from histogram buckets), increase (total increase over a window), and many more. Query execution: (1) The query engine parses the PromQL expression into an AST. (2) Series selection: find all time series matching the label selectors (using the inverted index in each block). (3) Data retrieval: read the required time range from the chunks. (4) Evaluation: apply functions, aggregations, and operators. (5) Return the result as a vector (instant) or matrix (range). Performance: queries that scan many series (high cardinality) or long time ranges are expensive. The inverted index enables fast label filtering. But a query like {job=”api”} matching 100,000 series requires reading data for all 100,000 — this can be slow. Cardinality management is the most important Prometheus operational concern.
Alertmanager
Prometheus evaluates alert rules (PromQL expressions with thresholds) every evaluation interval (typically 1 minute). When an alert fires, it is sent to Alertmanager. Alertmanager responsibilities: (1) Grouping — group related alerts into a single notification. All alerts with alertname=”HighErrorRate” for the same service are grouped. One notification instead of 50 individual alerts. (2) Deduplication — if multiple Prometheus instances (HA pair) fire the same alert, Alertmanager deduplicates (sends one notification, not two). (3) Silencing — suppress alerts during planned maintenance. Create a silence matching specific labels and a time window. (4) Inhibition — suppress lower-severity alerts when a higher-severity alert is firing. If “ServiceDown” is firing, inhibit “HighLatency” for the same service (latency is a symptom; the service being down is the root cause). (5) Routing — route alerts to the correct notification channel based on labels. severity=”critical” -> PagerDuty (pages on-call). severity=”warning” -> Slack channel. team=”platform” -> platform-alerts channel. Alertmanager HA: run 2-3 Alertmanager instances in a cluster. They communicate via a gossip protocol to share silence and notification state. All instances receive all alerts; they deduplicate among themselves to ensure exactly one notification per alert group.
Long-Term Storage: Thanos and Cortex
Prometheus local TSDB has a practical retention limit of 15-30 days (limited by local disk size and query performance). For long-term storage and global view across multiple Prometheus instances: Thanos architecture: (1) Thanos Sidecar — runs alongside each Prometheus instance. Uploads the 2-hour TSDB blocks to object storage (S3, GCS) as they are compacted. Prometheus retains 2 hours locally; all historical data is in object storage. (2) Thanos Store — a gateway that serves data from object storage. Query requests for historical data are served from S3/GCS. (3) Thanos Query — a PromQL-compatible query frontend. It fans out queries to: Prometheus instances (for recent data) and Thanos Store (for historical data). The user sees a seamless view from minutes ago to years ago. (4) Thanos Compactor — background compaction and downsampling of data in object storage. Merge overlapping blocks, downsample to 5-minute and 1-hour resolutions for older data. This reduces storage and query costs for historical queries. Cortex / Mimir (alternative): a multi-tenant, horizontally scalable Prometheus-compatible system. Prometheus remote-writes samples to Cortex, which stores them in a distributed TSDB. Better for: multi-tenant SaaS (each customer gets their own metrics namespace), very high cardinality (distributes across many nodes), and when Thanos sidecar model does not fit the architecture. Choice: Thanos for: extending existing Prometheus deployments with long-term storage. Cortex/Mimir for: multi-tenant or very large-scale deployments requiring horizontal scaling of ingestion and querying.
{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”How does Prometheus TSDB achieve 1.3 bytes per sample?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Prometheus TSDB uses two compression techniques: (1) Delta-of-delta timestamp encoding: with regular 15-second scrape intervals, the delta between timestamps is constant (15s). The delta-of-delta is 0, compressing to ~1 bit per sample. Irregular intervals produce small deltas compressing to a few bits. (2) XOR value encoding (Gorilla paper): XOR consecutive values. For stable metrics (CPU = 45.2, 45.3, 45.1), XOR produces values with many leading and trailing zeros. Only the meaningful bits in the middle are stored. Result: ~1.3 bytes per sample average (vs 16 bytes uncompressed: 8-byte timestamp + 8-byte float). Data is organized into 2-hour blocks with: inverted index (label set -> series), chunk files (compressed samples), and WAL for crash recovery. Head block (current 2 hours) in memory; older blocks compacted on disk. Retention by time or size.”}},{“@type”:”Question”,”name”:”How does Thanos extend Prometheus for long-term storage?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Prometheus local retention is limited to 15-30 days. Thanos adds long-term storage in object storage (S3/GCS): (1) Thanos Sidecar runs alongside each Prometheus, uploading 2-hour TSDB blocks to S3 as they compact. Prometheus retains 2 hours locally. (2) Thanos Store serves queries against data in S3. (3) Thanos Query is a PromQL frontend that fans out to Prometheus instances (recent data) + Thanos Store (historical). Users see seamless view from minutes to years. (4) Thanos Compactor merges and downsamples blocks in S3 (5-min and 1-hour resolutions for older data). Alternative: Cortex/Mimir for multi-tenant or very high-scale deployments. Prometheus remote-writes to Cortex which stores in a distributed TSDB. Better for: SaaS multi-tenancy, very high cardinality, horizontal scaling of ingestion and querying.”}}]}