Time-Series Database Low-Level Design: Chunk Storage, Compression, Downsampling, and Retention

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How does delta-of-delta timestamp compression work?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Delta-of-delta encoding first computes the difference between consecutive timestamps (delta), and then computes the difference between consecutive deltas (delta-of-delta). For regular time-series data collected at fixed intervals (e.g., every 15 seconds), the delta is always the same constant, so the delta-of-delta is always 0. A stream of zeros compresses to near-zero bits using variable-length encoding. Only irregular samples (missed intervals, jitter) produce non-zero delta-of-delta values, which are encoded with a small number of bits. This is the timestamp encoding used in the Gorilla paper from Facebook, achieving roughly 1.37 bits per timestamp on regular series.”
}
},
{
“@type”: “Question”,
“name”: “How does XOR value compression work for floating-point time-series?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “XOR compression, from the Gorilla algorithm, XORs consecutive float64 values. When a sensor reads similar values over time, consecutive floats share the same high-order bits (sign, exponent, and leading mantissa bits). The XOR result has a long prefix of leading zeros and a long suffix of trailing zeros, leaving only a small meaningful XOR block. The encoder stores the number of leading zeros, the length of the meaningful block, and the meaningful bits themselves. For slowly changing metrics, meaningful block sizes are typically 10-20 bits versus 64 bits raw, yielding 3-6x compression on values alone.”
}
},
{
“@type”: “Question”,
“name”: “What triggers chunk sealing and what happens after a chunk is sealed?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A chunk is sealed when it reaches its time boundary (e.g., the 8-hour window closes) or when it hits a maximum byte size. Before sealing, the chunk is mutable and lives in memory. On seal, the chunk is compressed using delta-of-delta + XOR encoding, written to disk as an immutable block, and its metadata (chunk_start, chunk_end, row_count, compressed_size) is recorded in the catalog. After sealing, the chunk can be replicated, moved to warm storage, or downsampled. In-memory buffer is released. Sealed chunks are append-only and never modified; corrections are handled by writing a new data point with a later ingest timestamp.”
}
},
{
“@type”: “Question”,
“name”: “How do tiered retention transitions work in a time-series database?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Tiered retention moves chunks from hot storage (full resolution, SSD) to warm storage (downsampled to 1-minute rollups, HDD) after a configurable age, and then to cold storage (1-hour rollups, object storage like S3) after a longer age. The background retention job scans MetricChunk records older than the hot tier threshold, runs the downsampling aggregation (MIN, MAX, AVG, COUNT per interval), writes the rollup chunk to the next tier, and deletes the original full-resolution chunk. Cold tier chunks are stored as compressed parquet files in object storage. Queries are tier-aware: the query planner selects the finest-resolution tier that covers the requested time range.”
}
}
]
}

A time-series database (TSDB) is purpose-built for append-heavy, time-ordered metric data. Its key design decisions — chunk storage, timestamp and value compression, continuous aggregation, and tiered retention — are all driven by the access pattern: writes are always at the current time, and reads scan backward over recent data at varying resolutions.

Chunk Storage Architecture

Data is partitioned into time-based chunks per metric. A chunk represents a fixed time window (e.g., 8 hours) for a single metric or metric+label combination. Typical chunk lifecycle:

Active (in-memory): newly arriving data points are buffered in the current chunk in RAM.
Sealed: when the time window closes or the chunk hits a size limit, it is compressed and flushed to disk as an immutable block.
Tiered: sealed chunks age through hot → warm → cold storage tiers based on retention policy.

Chunk granularity is a key tuning parameter. Smaller chunks (1–2 hours) give finer-grained retention and faster deletion but increase metadata overhead. Larger chunks (12–24 hours) reduce metadata and improve scan efficiency but hold memory longer.

Timestamp Compression: Delta-of-Delta Encoding

Raw timestamps are 64-bit integers (nanoseconds since epoch). Storing them raw is wasteful for regular-interval series.

Delta encoding: store the difference between consecutive timestamps instead of absolute values. For a 15-second interval, every delta is 15,000,000,000 nanoseconds — a repeating constant.

Delta-of-delta encoding: store the difference between consecutive deltas. For a perfectly regular series, all delta-of-delta values are 0. These zeros are encoded with a small number of bits using a variable-length scheme:

Delta-of-delta = 0: encode as a single 0 bit.
Small delta-of-delta (within ±63): encode with 2-bit prefix + 7-bit value = 9 bits.
Larger values: progressively wider encodings.

In practice, this compresses timestamps for regular-interval series to ~1.4 bits per sample.

Value Compression: XOR (Gorilla Algorithm)

Consecutive float64 metric values (e.g., CPU utilisation over time) tend to be similar, meaning they share high-order bits.

Algorithm per value:

XOR the current float64 with the previous float64.
Compute leading zero count (lzc) and trailing zero count (tzc) of the XOR result.
If lzc ≥ previous lzc and tzc ≥ previous tzc, reuse the previous block length (1-bit flag).
Otherwise, encode lzc (5 bits), block length (6 bits), and the meaningful bits of the XOR result.

For slowly changing metrics, meaningful blocks are 10–20 bits, versus 64 bits raw — a 3–6x compression ratio on values before applying a general-purpose compressor (e.g., Snappy or Zstd) on top.

Write Path

Incoming data point (metric_id, timestamp, value) is routed to the active in-memory chunk for that metric.
The data point is appended to the chunk's write buffer (uncompressed for fast append).
WAL entry written for crash recovery.
When the chunk's time window closes, the write buffer is compressed (delta-of-delta + XOR) and the sealed chunk is written to disk.
WAL segment for that chunk is cleared.

Read Path

Chunk pruning: the query planner looks up the MetricChunk catalog for metric_id to identify which chunks overlap the requested time range. Chunks outside the range are skipped entirely.
Decompression: selected chunks are decompressed — timestamps via delta-of-delta reversal, values via XOR reconstruction.
Scan and filter: decompressed samples are scanned within the requested time range; aggregations (AVG, SUM, MAX) are applied.
Tier routing: if the range spans multiple tiers, the query engine reads full-resolution data from hot, rollup data from warm/cold, and merges results.

Continuous Aggregation and Downsampling

Continuous aggregation maintains pre-computed rollup tables that are incrementally updated as new data arrives. For example, a 1-minute rollup stores (MIN, MAX, SUM, COUNT) per metric per minute, computed from raw samples as they are ingested.

Downsampling for tiered retention: when a chunk ages out of the hot tier, a background job reads the raw chunk, computes rollup aggregates at the target resolution (e.g., 1 minute for warm, 1 hour for cold), and writes the rollup chunk to the lower tier. The original full-resolution chunk is then deleted.

SQL DDL: Metadata Catalog

-- Chunk metadata catalog
CREATE TABLE MetricChunk (
    id              BIGSERIAL PRIMARY KEY,
    metric_id       BIGINT        NOT NULL,
    chunk_start     TIMESTAMPTZ   NOT NULL,
    chunk_end       TIMESTAMPTZ   NOT NULL,
    compressed_data BYTEA,            -- NULL for remote/tiered chunks
    row_count       INTEGER       NOT NULL DEFAULT 0,
    compressed_at   TIMESTAMPTZ,
    storage_tier    VARCHAR(16)   NOT NULL DEFAULT 'hot', -- hot, warm, cold
    storage_path    TEXT,             -- S3 path for cold tier
    UNIQUE (metric_id, chunk_start)
);

CREATE INDEX idx_chunk_metric_time ON MetricChunk (metric_id, chunk_start, chunk_end);

-- Retention tier configuration
CREATE TABLE RetentionTier (
    tier_name          VARCHAR(16)  PRIMARY KEY,
    resolution_seconds INTEGER      NOT NULL,  -- 1 = raw, 60 = 1min, 3600 = 1hr
    max_age_days       INTEGER      NOT NULL,
    storage_class      VARCHAR(32)  NOT NULL   -- ssd, hdd, s3_standard, s3_glacier
);

INSERT INTO RetentionTier VALUES
    ('hot',  1,    7,   'ssd'),
    ('warm', 60,   30,  'hdd'),
    ('cold', 3600, 365, 's3_standard');

Python: Core Operations

import struct
import time
from typing import Iterator

CHUNK_DURATION_SECONDS = 8 * 3600  # 8-hour chunks

# In-memory chunk buffer: metric_id -> list of (timestamp_ns, value)
_active_chunks: dict[int, list[tuple[int, float]]] = {}

def write_point(metric_id: int, timestamp: int, value: float) -> None:
    """Buffer a data point into the active in-memory chunk."""
    if metric_id not in _active_chunks:
        _active_chunks[metric_id] = []
    _active_chunks[metric_id].append((timestamp, value))
    chunk_start = (timestamp // (CHUNK_DURATION_SECONDS * 10**9)) * CHUNK_DURATION_SECONDS * 10**9
    if timestamp >= chunk_start + CHUNK_DURATION_SECONDS * 10**9:
        seal_chunk(metric_id)

def read_range(metric_id: int, start_ts: int, end_ts: int) -> list[tuple[int, float]]:
    """Read all data points for a metric within a nanosecond timestamp range."""
    # 1. Query MetricChunk catalog for overlapping chunks (DB call omitted for brevity)
    # 2. Decompress and scan each chunk
    results = []
    buffered = _active_chunks.get(metric_id, [])
    for ts, val in buffered:
        if start_ts <= ts  bytes:
    """Compress the active chunk using delta-of-delta + XOR encoding."""
    points = _active_chunks.pop(metric_id, [])
    if not points:
        return b''
    timestamps = [p[0] for p in points]
    values = [p[1] for p in points]
    compressed_ts = _delta_of_delta_encode(timestamps)
    compressed_vals = _xor_encode(values)
    payload = compressed_ts + compressed_vals
    # Persist to MetricChunk table (DB call omitted)
    return payload

def downsample_chunk(chunk_id: int, resolution_seconds: int) -> list[dict]:
    """Aggregate a sealed chunk into rollup buckets of the given resolution."""
    # Load and decompress chunk (simplified)
    raw_points = []  # would be loaded from MetricChunk.compressed_data
    buckets: dict[int, list[float]] = {}
    for ts, val in raw_points:
        bucket = (ts // (resolution_seconds * 10**9)) * resolution_seconds * 10**9
        buckets.setdefault(bucket, []).append(val)
    return [
        {'bucket': b, 'min': min(vals), 'max': max(vals),
         'avg': sum(vals)/len(vals), 'count': len(vals)}
        for b, vals in sorted(buckets.items())
    ]

def _delta_of_delta_encode(timestamps: list[int]) -> bytes:
    """Simplified delta-of-delta encoding (returns raw bytes for illustration)."""
    if not timestamps:
        return b''
    deltas = [timestamps[0]] + [timestamps[i]-timestamps[i-1] for i in range(1,len(timestamps))]
    dod = [deltas[0]] + [deltas[i]-deltas[i-1] for i in range(1,len(deltas))]
    return struct.pack(f'>{len(dod)}q', *dod)

def _xor_encode(values: list[float]) -> bytes:
    """Simplified XOR encoding for float64 values."""
    if not values:
        return b''
    raw = struct.pack(f'>{len(values)}d', *values)
    return raw  # production: apply bit-packing on XOR residuals

Design Considerations Summary

Chunk size: balance between metadata overhead (small chunks) and memory usage (large chunks); 2–8 hours is typical.
Compression: delta-of-delta + XOR achieves 10–40x compression on regular time-series before general-purpose compression.
Continuous aggregation: pre-compute rollups at ingest time to serve dashboard queries without scanning raw data.
Tiered retention: full resolution for operational queries (last 7 days), minute rollups for trend analysis (30 days), hourly rollups for long-term analytics (1 year).
Cardinality: high label cardinality (many unique metric_id combinations) explodes chunk count and metadata size; enforce label cardinality limits at ingest.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How does delta-of-delta timestamp compression work?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The first timestamp is stored as-is; subsequent timestamps store the difference from the previous (delta); the delta-of-delta stores the difference between consecutive deltas — for regular intervals this is always 0, requiring only 1 bit per timestamp.”
}
},
{
“@type”: “Question”,
“name”: “How does XOR compression encode floating-point values?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The first value is stored uncompressed; each subsequent value is XORed with the previous; since consecutive measurements are usually similar, the XOR result has many leading and trailing zeros, enabling compact variable-length encoding (Gorilla algorithm).”
}
},
{
“@type”: “Question”,
“name”: “How are chunks sealed and when does sealing occur?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A chunk is sealed when it reaches its time boundary (e.g., 8-hour window ends) or its size limit; sealed chunks are compressed and made immutable; only the active (unsealed) chunk accepts new writes.”
}
},
{
“@type”: “Question”,
“name”: “How does tiered retention work for long-term storage?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A retention job checks chunk ages against tier thresholds; hot-tier chunks (SSD, full resolution) are downsampled and moved to warm tier (HDD, 1-minute rollup) after 7 days; warm chunks are further downsampled and archived to cold storage after 30 days.”
}
}
]
}