Time-Series Database Low-Level Design: Chunk Storage, Compression, Downsampling, and Retention

A time-series database (TSDB) is purpose-built for append-heavy, time-ordered metric data. Its key design decisions — chunk storage, timestamp and value compression, continuous aggregation, and tiered retention — are all driven by the access pattern: writes are always at the current time, and reads scan backward over recent data at varying resolutions.

Chunk Storage Architecture

Data is partitioned into time-based chunks per metric. A chunk represents a fixed time window (e.g., 8 hours) for a single metric or metric+label combination. Typical chunk lifecycle:

  1. Active (in-memory): newly arriving data points are buffered in the current chunk in RAM.
  2. Sealed: when the time window closes or the chunk hits a size limit, it is compressed and flushed to disk as an immutable block.
  3. Tiered: sealed chunks age through hot → warm → cold storage tiers based on retention policy.

Chunk granularity is a key tuning parameter. Smaller chunks (1–2 hours) give finer-grained retention and faster deletion but increase metadata overhead. Larger chunks (12–24 hours) reduce metadata and improve scan efficiency but hold memory longer.

Timestamp Compression: Delta-of-Delta Encoding

Raw timestamps are 64-bit integers (nanoseconds since epoch). Storing them raw is wasteful for regular-interval series.

Delta encoding: store the difference between consecutive timestamps instead of absolute values. For a 15-second interval, every delta is 15,000,000,000 nanoseconds — a repeating constant.

Delta-of-delta encoding: store the difference between consecutive deltas. For a perfectly regular series, all delta-of-delta values are 0. These zeros are encoded with a small number of bits using a variable-length scheme:

  • Delta-of-delta = 0: encode as a single 0 bit.
  • Small delta-of-delta (within ±63): encode with 2-bit prefix + 7-bit value = 9 bits.
  • Larger values: progressively wider encodings.

In practice, this compresses timestamps for regular-interval series to ~1.4 bits per sample.

Value Compression: XOR (Gorilla Algorithm)

Consecutive float64 metric values (e.g., CPU utilisation over time) tend to be similar, meaning they share high-order bits.

Algorithm per value:

  1. XOR the current float64 with the previous float64.
  2. Compute leading zero count (lzc) and trailing zero count (tzc) of the XOR result.
  3. If lzc ≥ previous lzc and tzc ≥ previous tzc, reuse the previous block length (1-bit flag).
  4. Otherwise, encode lzc (5 bits), block length (6 bits), and the meaningful bits of the XOR result.

For slowly changing metrics, meaningful blocks are 10–20 bits, versus 64 bits raw — a 3–6x compression ratio on values before applying a general-purpose compressor (e.g., Snappy or Zstd) on top.

Write Path

  1. Incoming data point (metric_id, timestamp, value) is routed to the active in-memory chunk for that metric.
  2. The data point is appended to the chunk's write buffer (uncompressed for fast append).
  3. WAL entry written for crash recovery.
  4. When the chunk's time window closes, the write buffer is compressed (delta-of-delta + XOR) and the sealed chunk is written to disk.
  5. WAL segment for that chunk is cleared.

Read Path

  1. Chunk pruning: the query planner looks up the MetricChunk catalog for metric_id to identify which chunks overlap the requested time range. Chunks outside the range are skipped entirely.
  2. Decompression: selected chunks are decompressed — timestamps via delta-of-delta reversal, values via XOR reconstruction.
  3. Scan and filter: decompressed samples are scanned within the requested time range; aggregations (AVG, SUM, MAX) are applied.
  4. Tier routing: if the range spans multiple tiers, the query engine reads full-resolution data from hot, rollup data from warm/cold, and merges results.

Continuous Aggregation and Downsampling

Continuous aggregation maintains pre-computed rollup tables that are incrementally updated as new data arrives. For example, a 1-minute rollup stores (MIN, MAX, SUM, COUNT) per metric per minute, computed from raw samples as they are ingested.

Downsampling for tiered retention: when a chunk ages out of the hot tier, a background job reads the raw chunk, computes rollup aggregates at the target resolution (e.g., 1 minute for warm, 1 hour for cold), and writes the rollup chunk to the lower tier. The original full-resolution chunk is then deleted.

SQL DDL: Metadata Catalog

-- Chunk metadata catalog
CREATE TABLE MetricChunk (
    id              BIGSERIAL PRIMARY KEY,
    metric_id       BIGINT        NOT NULL,
    chunk_start     TIMESTAMPTZ   NOT NULL,
    chunk_end       TIMESTAMPTZ   NOT NULL,
    compressed_data BYTEA,            -- NULL for remote/tiered chunks
    row_count       INTEGER       NOT NULL DEFAULT 0,
    compressed_at   TIMESTAMPTZ,
    storage_tier    VARCHAR(16)   NOT NULL DEFAULT 'hot', -- hot, warm, cold
    storage_path    TEXT,             -- S3 path for cold tier
    UNIQUE (metric_id, chunk_start)
);

CREATE INDEX idx_chunk_metric_time ON MetricChunk (metric_id, chunk_start, chunk_end);

-- Retention tier configuration
CREATE TABLE RetentionTier (
    tier_name          VARCHAR(16)  PRIMARY KEY,
    resolution_seconds INTEGER      NOT NULL,  -- 1 = raw, 60 = 1min, 3600 = 1hr
    max_age_days       INTEGER      NOT NULL,
    storage_class      VARCHAR(32)  NOT NULL   -- ssd, hdd, s3_standard, s3_glacier
);

INSERT INTO RetentionTier VALUES
    ('hot',  1,    7,   'ssd'),
    ('warm', 60,   30,  'hdd'),
    ('cold', 3600, 365, 's3_standard');

Python: Core Operations

import struct
import time
from typing import Iterator

CHUNK_DURATION_SECONDS = 8 * 3600  # 8-hour chunks

# In-memory chunk buffer: metric_id -> list of (timestamp_ns, value)
_active_chunks: dict[int, list[tuple[int, float]]] = {}

def write_point(metric_id: int, timestamp: int, value: float) -> None:
    """Buffer a data point into the active in-memory chunk."""
    if metric_id not in _active_chunks:
        _active_chunks[metric_id] = []
    _active_chunks[metric_id].append((timestamp, value))
    chunk_start = (timestamp // (CHUNK_DURATION_SECONDS * 10**9)) * CHUNK_DURATION_SECONDS * 10**9
    if timestamp >= chunk_start + CHUNK_DURATION_SECONDS * 10**9:
        seal_chunk(metric_id)

def read_range(metric_id: int, start_ts: int, end_ts: int) -> list[tuple[int, float]]:
    """Read all data points for a metric within a nanosecond timestamp range."""
    # 1. Query MetricChunk catalog for overlapping chunks (DB call omitted for brevity)
    # 2. Decompress and scan each chunk
    results = []
    buffered = _active_chunks.get(metric_id, [])
    for ts, val in buffered:
        if start_ts <= ts  bytes:
    """Compress the active chunk using delta-of-delta + XOR encoding."""
    points = _active_chunks.pop(metric_id, [])
    if not points:
        return b''
    timestamps = [p[0] for p in points]
    values = [p[1] for p in points]
    compressed_ts = _delta_of_delta_encode(timestamps)
    compressed_vals = _xor_encode(values)
    payload = compressed_ts + compressed_vals
    # Persist to MetricChunk table (DB call omitted)
    return payload

def downsample_chunk(chunk_id: int, resolution_seconds: int) -> list[dict]:
    """Aggregate a sealed chunk into rollup buckets of the given resolution."""
    # Load and decompress chunk (simplified)
    raw_points = []  # would be loaded from MetricChunk.compressed_data
    buckets: dict[int, list[float]] = {}
    for ts, val in raw_points:
        bucket = (ts // (resolution_seconds * 10**9)) * resolution_seconds * 10**9
        buckets.setdefault(bucket, []).append(val)
    return [
        {'bucket': b, 'min': min(vals), 'max': max(vals),
         'avg': sum(vals)/len(vals), 'count': len(vals)}
        for b, vals in sorted(buckets.items())
    ]

def _delta_of_delta_encode(timestamps: list[int]) -> bytes:
    """Simplified delta-of-delta encoding (returns raw bytes for illustration)."""
    if not timestamps:
        return b''
    deltas = [timestamps[0]] + [timestamps[i]-timestamps[i-1] for i in range(1,len(timestamps))]
    dod = [deltas[0]] + [deltas[i]-deltas[i-1] for i in range(1,len(deltas))]
    return struct.pack(f'>{len(dod)}q', *dod)

def _xor_encode(values: list[float]) -> bytes:
    """Simplified XOR encoding for float64 values."""
    if not values:
        return b''
    raw = struct.pack(f'>{len(values)}d', *values)
    return raw  # production: apply bit-packing on XOR residuals

Design Considerations Summary

  • Chunk size: balance between metadata overhead (small chunks) and memory usage (large chunks); 2–8 hours is typical.
  • Compression: delta-of-delta + XOR achieves 10–40x compression on regular time-series before general-purpose compression.
  • Continuous aggregation: pre-compute rollups at ingest time to serve dashboard queries without scanning raw data.
  • Tiered retention: full resolution for operational queries (last 7 days), minute rollups for trend analysis (30 days), hourly rollups for long-term analytics (1 year).
  • Cardinality: high label cardinality (many unique metric_id combinations) explodes chunk count and metadata size; enforce label cardinality limits at ingest.

See also: Databricks Interview Guide 2026: Spark Internals, Delta Lake, and Lakehouse Architecture

See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering

See also: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence

Scroll to Top