A time-series database (TSDB) is purpose-built for append-heavy, time-ordered metric data. Its key design decisions — chunk storage, timestamp and value compression, continuous aggregation, and tiered retention — are all driven by the access pattern: writes are always at the current time, and reads scan backward over recent data at varying resolutions.
Chunk Storage Architecture
Data is partitioned into time-based chunks per metric. A chunk represents a fixed time window (e.g., 8 hours) for a single metric or metric+label combination. Typical chunk lifecycle:
- Active (in-memory): newly arriving data points are buffered in the current chunk in RAM.
- Sealed: when the time window closes or the chunk hits a size limit, it is compressed and flushed to disk as an immutable block.
- Tiered: sealed chunks age through hot → warm → cold storage tiers based on retention policy.
Chunk granularity is a key tuning parameter. Smaller chunks (1–2 hours) give finer-grained retention and faster deletion but increase metadata overhead. Larger chunks (12–24 hours) reduce metadata and improve scan efficiency but hold memory longer.
Timestamp Compression: Delta-of-Delta Encoding
Raw timestamps are 64-bit integers (nanoseconds since epoch). Storing them raw is wasteful for regular-interval series.
Delta encoding: store the difference between consecutive timestamps instead of absolute values. For a 15-second interval, every delta is 15,000,000,000 nanoseconds — a repeating constant.
Delta-of-delta encoding: store the difference between consecutive deltas. For a perfectly regular series, all delta-of-delta values are 0. These zeros are encoded with a small number of bits using a variable-length scheme:
- Delta-of-delta = 0: encode as a single 0 bit.
- Small delta-of-delta (within ±63): encode with 2-bit prefix + 7-bit value = 9 bits.
- Larger values: progressively wider encodings.
In practice, this compresses timestamps for regular-interval series to ~1.4 bits per sample.
Value Compression: XOR (Gorilla Algorithm)
Consecutive float64 metric values (e.g., CPU utilisation over time) tend to be similar, meaning they share high-order bits.
Algorithm per value:
- XOR the current float64 with the previous float64.
- Compute leading zero count (lzc) and trailing zero count (tzc) of the XOR result.
- If lzc ≥ previous lzc and tzc ≥ previous tzc, reuse the previous block length (1-bit flag).
- Otherwise, encode lzc (5 bits), block length (6 bits), and the meaningful bits of the XOR result.
For slowly changing metrics, meaningful blocks are 10–20 bits, versus 64 bits raw — a 3–6x compression ratio on values before applying a general-purpose compressor (e.g., Snappy or Zstd) on top.
Write Path
- Incoming data point (metric_id, timestamp, value) is routed to the active in-memory chunk for that metric.
- The data point is appended to the chunk's write buffer (uncompressed for fast append).
- WAL entry written for crash recovery.
- When the chunk's time window closes, the write buffer is compressed (delta-of-delta + XOR) and the sealed chunk is written to disk.
- WAL segment for that chunk is cleared.
Read Path
- Chunk pruning: the query planner looks up the MetricChunk catalog for metric_id to identify which chunks overlap the requested time range. Chunks outside the range are skipped entirely.
- Decompression: selected chunks are decompressed — timestamps via delta-of-delta reversal, values via XOR reconstruction.
- Scan and filter: decompressed samples are scanned within the requested time range; aggregations (AVG, SUM, MAX) are applied.
- Tier routing: if the range spans multiple tiers, the query engine reads full-resolution data from hot, rollup data from warm/cold, and merges results.
Continuous Aggregation and Downsampling
Continuous aggregation maintains pre-computed rollup tables that are incrementally updated as new data arrives. For example, a 1-minute rollup stores (MIN, MAX, SUM, COUNT) per metric per minute, computed from raw samples as they are ingested.
Downsampling for tiered retention: when a chunk ages out of the hot tier, a background job reads the raw chunk, computes rollup aggregates at the target resolution (e.g., 1 minute for warm, 1 hour for cold), and writes the rollup chunk to the lower tier. The original full-resolution chunk is then deleted.
SQL DDL: Metadata Catalog
-- Chunk metadata catalog
CREATE TABLE MetricChunk (
id BIGSERIAL PRIMARY KEY,
metric_id BIGINT NOT NULL,
chunk_start TIMESTAMPTZ NOT NULL,
chunk_end TIMESTAMPTZ NOT NULL,
compressed_data BYTEA, -- NULL for remote/tiered chunks
row_count INTEGER NOT NULL DEFAULT 0,
compressed_at TIMESTAMPTZ,
storage_tier VARCHAR(16) NOT NULL DEFAULT 'hot', -- hot, warm, cold
storage_path TEXT, -- S3 path for cold tier
UNIQUE (metric_id, chunk_start)
);
CREATE INDEX idx_chunk_metric_time ON MetricChunk (metric_id, chunk_start, chunk_end);
-- Retention tier configuration
CREATE TABLE RetentionTier (
tier_name VARCHAR(16) PRIMARY KEY,
resolution_seconds INTEGER NOT NULL, -- 1 = raw, 60 = 1min, 3600 = 1hr
max_age_days INTEGER NOT NULL,
storage_class VARCHAR(32) NOT NULL -- ssd, hdd, s3_standard, s3_glacier
);
INSERT INTO RetentionTier VALUES
('hot', 1, 7, 'ssd'),
('warm', 60, 30, 'hdd'),
('cold', 3600, 365, 's3_standard');
Python: Core Operations
import struct
import time
from typing import Iterator
CHUNK_DURATION_SECONDS = 8 * 3600 # 8-hour chunks
# In-memory chunk buffer: metric_id -> list of (timestamp_ns, value)
_active_chunks: dict[int, list[tuple[int, float]]] = {}
def write_point(metric_id: int, timestamp: int, value: float) -> None:
"""Buffer a data point into the active in-memory chunk."""
if metric_id not in _active_chunks:
_active_chunks[metric_id] = []
_active_chunks[metric_id].append((timestamp, value))
chunk_start = (timestamp // (CHUNK_DURATION_SECONDS * 10**9)) * CHUNK_DURATION_SECONDS * 10**9
if timestamp >= chunk_start + CHUNK_DURATION_SECONDS * 10**9:
seal_chunk(metric_id)
def read_range(metric_id: int, start_ts: int, end_ts: int) -> list[tuple[int, float]]:
"""Read all data points for a metric within a nanosecond timestamp range."""
# 1. Query MetricChunk catalog for overlapping chunks (DB call omitted for brevity)
# 2. Decompress and scan each chunk
results = []
buffered = _active_chunks.get(metric_id, [])
for ts, val in buffered:
if start_ts <= ts bytes:
"""Compress the active chunk using delta-of-delta + XOR encoding."""
points = _active_chunks.pop(metric_id, [])
if not points:
return b''
timestamps = [p[0] for p in points]
values = [p[1] for p in points]
compressed_ts = _delta_of_delta_encode(timestamps)
compressed_vals = _xor_encode(values)
payload = compressed_ts + compressed_vals
# Persist to MetricChunk table (DB call omitted)
return payload
def downsample_chunk(chunk_id: int, resolution_seconds: int) -> list[dict]:
"""Aggregate a sealed chunk into rollup buckets of the given resolution."""
# Load and decompress chunk (simplified)
raw_points = [] # would be loaded from MetricChunk.compressed_data
buckets: dict[int, list[float]] = {}
for ts, val in raw_points:
bucket = (ts // (resolution_seconds * 10**9)) * resolution_seconds * 10**9
buckets.setdefault(bucket, []).append(val)
return [
{'bucket': b, 'min': min(vals), 'max': max(vals),
'avg': sum(vals)/len(vals), 'count': len(vals)}
for b, vals in sorted(buckets.items())
]
def _delta_of_delta_encode(timestamps: list[int]) -> bytes:
"""Simplified delta-of-delta encoding (returns raw bytes for illustration)."""
if not timestamps:
return b''
deltas = [timestamps[0]] + [timestamps[i]-timestamps[i-1] for i in range(1,len(timestamps))]
dod = [deltas[0]] + [deltas[i]-deltas[i-1] for i in range(1,len(deltas))]
return struct.pack(f'>{len(dod)}q', *dod)
def _xor_encode(values: list[float]) -> bytes:
"""Simplified XOR encoding for float64 values."""
if not values:
return b''
raw = struct.pack(f'>{len(values)}d', *values)
return raw # production: apply bit-packing on XOR residuals
Design Considerations Summary
- Chunk size: balance between metadata overhead (small chunks) and memory usage (large chunks); 2–8 hours is typical.
- Compression: delta-of-delta + XOR achieves 10–40x compression on regular time-series before general-purpose compression.
- Continuous aggregation: pre-compute rollups at ingest time to serve dashboard queries without scanning raw data.
- Tiered retention: full resolution for operational queries (last 7 days), minute rollups for trend analysis (30 days), hourly rollups for long-term analytics (1 year).
- Cardinality: high label cardinality (many unique metric_id combinations) explodes chunk count and metadata size; enforce label cardinality limits at ingest.
See also: Databricks Interview Guide 2026: Spark Internals, Delta Lake, and Lakehouse Architecture
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering