Q: How do you design alerts that fire reliably without false positives?

Common alert design: query the average over a 5-minute window every 60 seconds. Fire the alert if the average exceeds the threshold. Problems: (1) Flapping — alert fires and resolves repeatedly if the metric oscillates around the threshold. Fix: require the condition to be true for 3 consecutive checks before firing (pending state). (2) Missing data — if the metric stops being emitted (agent crash), the average query returns NULL, which does not trigger a > threshold condition. Fix: treat NULL as a separate "no data" alert condition. (3) Alert fatigue — too many non-critical alerts. Fix: require a minimum count of samples in the window before evaluating.

Question 1

Why must metrics be buffered and batched before writing to the database?

Accepted Answer

A metrics agent running on every host emits CPU, memory, and request-rate data every second. With 1,000 hosts: 1,000 writes/second. TimescaleDB handles ~50,000 single-row inserts/second, but each insert has overhead (WAL write, index update, lock acquisition). Batching 1,000 rows into a single INSERT reduces overhead by ~100x — the same work now requires only 1 write per second to the DB instead of 1,000. Buffer in memory for 5-10 seconds and flush using INSERT with UNNEST for a true batch insert. This is the difference between a metrics system that works and one that falls over under normal load.

Question 2

What is time_bucket and why is it the core query primitive for time series?

Accepted Answer

time_bucket(interval, timestamp) is TimescaleDB's equivalent of date_trunc but with arbitrary intervals. SELECT time_bucket('5 minutes', time) AS bucket, AVG(value) FROM Metric WHERE metric_name='cpu_usage' GROUP BY bucket ORDER BY bucket. This returns one row per 5-minute window with the average CPU. It replaces COUNT/GROUP BY patterns that would require complex EXTRACT arithmetic. The index on (metric_name, time DESC) makes this query a bounded index scan rather than a full table scan. Master this function — every time series query uses it.

Question 3

How do downsampled rollups reduce storage costs for long-term metrics?

Accepted Answer

Raw 1-second metrics for 1,000 metrics × 1 year = 31.5 billion rows. At ~50 bytes/row: ~1.5TB. Downsampling: after 7 days, replace raw data with hourly aggregates (avg, max, min, count). 1,000 metrics × 365 days × 24 hours = 8.76 million hourly rows ≈ 440MB. After 90 days, replace hourly with daily aggregates: 1,000 × 730 days = 730K rows ≈ 37MB. This 40,000x size reduction makes multi-year metric history affordable. Implement as TimescaleDB continuous aggregates (materialized views that auto-refresh) or scheduled cron jobs.

Question 4

How do you choose the right time bucket size for a query?

Accepted Answer

Match the bucket size to the time range: a 1-week view at 1-minute resolution generates 10,080 data points — too many for a chart (browser renders ~500-1000 points efficiently). A 1-week view at 1-hour resolution generates 168 points — perfect. Rule of thumb: target_points = time_range / bucket_interval ≈ 500. For a 1-day range: 500 → bucket = 86400/500 = 172 seconds ≈ 3 minutes. Auto-select bucket size based on time range in the API: <1h → 1m, <1d → 5m, <1w → 1h, <1mo → 6h, else → 1d. Route each range to the appropriate rollup table (raw or hourly or daily).

Question 5

How do you design alerts that fire reliably without false positives?

Accepted Answer

Common alert design: query the average over a 5-minute window every 60 seconds. Fire the alert if the average exceeds the threshold. Problems: (1) Flapping — alert fires and resolves repeatedly if the metric oscillates around the threshold. Fix: require the condition to be true for 3 consecutive checks before firing (pending state). (2) Missing data — if the metric stops being emitted (agent crash), the average query returns NULL, which does not trigger a > threshold condition. Fix: treat NULL as a separate "no data" alert condition. (3) Alert fatigue — too many non-critical alerts. Fix: require a minimum count of samples in the window before evaluating.

Time Series Metrics System Low-Level Design

Time Series Metrics System — Low-Level Design

Core Data Model Options

Write Path: Batching and Buffering

Query: Aggregated Time Range

Downsampling for Long-Term Storage

Alerting on Threshold Breach

Key Interview Points