Low Level Design: Time-Series Database

Introduction

Time-series databases (TSDB) are optimized for append-heavy workloads of (metric, timestamp, value) triples. Metrics monitoring, IoT telemetry, and financial tick data are the core use cases. Unlike general-purpose databases, TSDBs exploit the temporal ordering of writes and the compressibility of sequential numeric data to achieve far higher write throughput and storage efficiency.

Write Path

Incoming data points arrive in roughly time order. The write path: (1) write buffer accumulates points in memory; (2) WAL (write-ahead log) provides durability before acknowledgment; (3) WAL is flushed to the active chunk — the current 2-hour block in Prometheus TSDB; (4) the chunk is written as a compressed immutable file when full; (5) sealed chunks are compacted into larger blocks for long-term storage.

Full pipeline: memory → WAL → active chunk → sealed chunk → compacted block → long-term storage. Out-of-order writes go to a separate out-of-order buffer and are merged on compaction. The WAL is truncated once the corresponding chunk is durably written.

Chunk and Block Architecture

Chunk: a time-ordered array of (timestamp, value) pairs for a single metric series within a time window. Block: all chunks for all series within a time window (2 hours in Prometheus). Blocks are compacted into larger blocks (12h, 24h) to reduce the number of open file handles and improve query performance by reading fewer, larger files.

Query reads only the blocks whose time range overlaps the query range. Immutable blocks allow simple caching — a block written once is never modified, so cached reads are always valid. Block metadata (series labels, time range, chunk offsets) is stored in an index file within each block directory.

Compression Encoding

Timestamps: delta-of-delta encoding. Store the first timestamp raw, then the first delta, then the deltas of deltas as variable-length integers. For regular-interval data (every 15s), most delta-of-deltas are 0 and encode to 1 bit. Typical cost: 1–2 bytes per timestamp vs 8 bytes raw.

Values: Gorilla XOR encoding (Facebook, 2015). XOR consecutive float64 values; the result has many leading and trailing zero bits because consecutive metric values change slowly. Encode only the meaningful bits with a prefix describing leading/trailing zero counts. Typical cost: 1–2 bytes per value for slowly changing metrics. Combined compression ratio: 10–40x vs raw storage.

Downsampling and Rollup

Raw data is retained for 15 days. Hourly rollup (min, max, sum, count, avg per hour) is retained for 90 days. Daily rollup is retained for 2 years. The rollup job runs continuously: it reads raw chunks, computes aggregates over each rollup window, and writes results to rollup tables with the same compression encoding.

Queries are automatically routed to the appropriate resolution based on the requested time range: raw data for the last 24 hours, hourly rollup for up to 30 days, daily rollup for longer ranges. This keeps query latency and data scanned roughly constant regardless of the time range queried.

Retention Policy

Per-metric retention rules allow different metrics to have different lifespans. Critical system metrics: 1 year raw. Application metrics: 90 days raw. Debug metrics: 7 days raw. Automated deletion is triggered by age — blocks older than the retention threshold are removed. ILM (index lifecycle management) handles transitions between hot, warm, and cold storage tiers.

Cold storage: before deletion, blocks are copied to object storage (S3, GCS) for audit and compliance. Cold blocks can be rehydrated on demand for incident retrospectives. Storage cost drops 10–20x on object storage vs SSD, making long-term retention economical for compliance requirements.

Query Optimization

Label filtering reduces the set of series to scan before reading any chunk data. The query planner intersects posting lists (sorted lists of series IDs) for each label matcher using set intersection — analogous to an inverted index in full-text search. Time range filtering then selects relevant blocks. Chunk reads within selected blocks are parallelized.

Result cache: identical queries within a TTL window return cached results without re-scanning chunks. Recording rules pre-compute expensive aggregations (e.g., 99th percentile across 10,000 series) on ingestion, storing the result as a new synthetic metric. Queries against the recording rule series are orders of magnitude faster than computing the aggregation at query time.

Frequently Asked Questions: Time-Series Database Design

Q: How does Gorilla use delta-of-delta timestamp encoding and XOR float compression?
A: Facebook’s Gorilla TSDB paper (2015) introduced two key compression techniques. For timestamps, it stores the first timestamp as a full 64-bit value, then encodes successive deltas, and finally encodes the delta-of-delta (the change in the gap between timestamps). Because metrics are scraped at regular intervals, most delta-of-deltas are zero and can be encoded in 1 bit. Non-zero values use a variable-length bucket encoding (7, 9, 12, or 32 bits). For floating-point values, Gorilla XORs consecutive doubles. If the XOR is zero (value unchanged), it stores a single 0 bit. Otherwise it stores the XOR using the meaningful bits between the leading and trailing zeros, sharing the leading/trailing zero counts from the previous block when they match, saving further bits. Together these schemes achieve roughly 1.37 bytes per data point on average, down from 16 bytes raw.

Q: What is the 2-hour block architecture in Gorilla and how does WAL flushing work?
A: Gorilla segments time into fixed 2-hour blocks. Each block is a compressed, append-only chunk for all time series in that window. The active (current) block lives entirely in memory for fast writes and reads. Completed blocks are sealed and persisted to disk or object storage. A write-ahead log (WAL) records every incoming data point before it is compressed into the in-memory block. On crash recovery, the WAL is replayed to reconstruct the in-memory block. The 2-hour granularity balances memory footprint against flush frequency: shorter blocks increase I/O overhead; longer blocks increase memory usage and recovery time. Prometheus adopted the same 2-hour chunk architecture with a WAL that is checkpointed periodically to bound replay time.

Q: How does downsampling and tiered rollup design work in a time-series database?
A: Raw metrics are stored at full resolution (e.g., 15-second scrape interval) in hot storage (fast SSD or memory). As data ages, a background compaction job aggregates it into coarser resolutions: 1-minute rollups after 1 day, 5-minute rollups after 1 week, 1-hour rollups after 1 month. Each rollup stores configurable aggregates per window — min, max, sum, count, and p99. Queries targeting long time ranges automatically route to the appropriate resolution tier to avoid scanning billions of raw points. This tiered design, used by Cortex, Thanos, and M3DB, reduces query latency and storage cost by orders of magnitude for dashboards spanning months or years.

Q: How does a label index posting list enable efficient series selection in a TSDB?
A: In Prometheus-style TSDBs, each time series is identified by a unique set of label key-value pairs (e.g., {job=”api”, region=”us-east”, status=”500″}). The label index maps each label value to an inverted posting list — a sorted array of series IDs that carry that label. To evaluate a multi-label selector, the database fetches the posting lists for each matcher and intersects them (AND) or unions them (OR) using merge algorithms on sorted integer arrays. This approach is identical to full-text search inverted indexes. Prometheus stores posting lists in its TSDB block index; Cortex and Thanos distribute them across object storage with bloom filters to skip irrelevant blocks.

Q: What is the difference between recording rules and query-time PromQL aggregation?
A: PromQL aggregation executed at query time scans all raw series matching a selector, applies functions (sum, rate, histogram_quantile), and returns results. For high-cardinality metrics or long time ranges this is expensive and slow. Recording rules pre-compute expensive PromQL expressions on a fixed interval (e.g., every 60 seconds) and write the result as a new, lower-cardinality time series stored in the TSDB. Dashboards and alerts then query the pre-computed series instead of the raw data. Recording rules trade write amplification (more series to store) for dramatically faster read performance. They are essential for SLO burn-rate alerts and any dashboard that aggregates across thousands of instances.