Question 1

What are the tradeoffs between Prometheus pull model and push model for metrics collection?

Accepted Answer

Pull model (Prometheus default): Prometheus scrapes targets on a schedule, so it controls the collection rate and can detect when a target is down (scrape fails). Easier to reason about data freshness; no need to configure each service with a remote address. Downside: requires targets to be reachable from Prometheus, which is difficult for short-lived batch jobs or services behind strict firewalls. Push model (Pushgateway, or systems like InfluxDB/Graphite): services push metrics on their own schedule, useful for ephemeral jobs and cross-datacenter topologies. Downside: Prometheus cannot distinguish a silent service from a stopped one, and the Pushgateway becomes a single point of failure. Use push for batch jobs, pull for long-running services.

Question 2

What is the difference between PromQL rate() and irate() for calculating request rate?

Accepted Answer

rate() computes the per-second average rate of increase over the entire specified range window (e.g., [5m]), smoothing out spikes. It handles counter resets and is suitable for dashboards and alerting where a stable average matters. irate() uses only the last two data points in the range window to compute an instantaneous rate, making it highly sensitive to short-lived spikes. Use rate() for alert rules (avoids false positives from brief spikes) and for trend graphs. Use irate() when you need to detect sudden bursts or when your scrape interval is coarse relative to the window. Both divide by elapsed seconds and handle resets via the increase over the window.

Question 3

How do you use histogram_quantile to calculate p99 latency from a Prometheus histogram metric?

Accepted Answer

Prometheus histograms store observations in cumulative buckets (e.g., http_request_duration_seconds_bucket). To get p99 latency over 5 minutes: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])). The rate() computes per-second bucket increments; histogram_quantile interpolates linearly within the bucket containing the 99th percentile. Accuracy depends on bucket boundaries: define buckets that bracket your expected SLO (e.g., 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5 seconds). Aggregating across instances requires summing bucket rates first: histogram_quantile(0.99, sum by (le) (rate(http_request_duration_seconds_bucket[5m]))). Native histograms (Prometheus 2.40+) avoid the bucket pre-configuration requirement.

Question 4

How does AlertManager handle alert deduplication and routing?

Accepted Answer

AlertManager receives alerts from Prometheus and groups identical firing alerts (same label set) into a single notification using the group_by config, preventing alert storms. It deduplicates by fingerprinting the label set: repeated firings within group_wait and group_interval are coalesced. Routing is a tree of matchers: the root route catches all alerts; child routes match on label selectors (e.g., severity=critical routes to PagerDuty, team=infra routes to the infra Slack channel). inhibit_rules suppress child alerts when a parent alert is firing (e.g., suppress all service alerts when the entire datacenter is down). silence rules provide time-bounded muting for maintenance windows.

Question 5

How does Thanos extend Prometheus for long-term metrics storage?

Accepted Answer

Thanos adds a sidecar to each Prometheus pod that uploads completed 2-hour TSDB blocks to object storage (S3, GCS). The Thanos Store Gateway serves historical blocks from object storage with a block-level index cache. Thanos Query is a Prometheus-compatible query frontend that fans out PromQL to both live Prometheus instances (via the sidecar) and the Store Gateway, deduplicating results from HA pairs using the replica label. Thanos Compactor runs retention and downsampling (5-minute and 1-hour resolution) on object storage to reduce query scan cost for long time ranges. This architecture gives unlimited retention without growing local Prometheus disk, and a single global query view across multiple clusters.

Low Level Design: Metrics Monitoring System

Metric Types

Pull vs Push Model

Service Discovery

Time Series Storage

PromQL Query Language

Alerting with AlertManager

Dashboard Design

Long-Term Storage