LSM-Tree Storage Engine Low-Level Design: Memtable, SSTables, Compaction, and Read Path Optimization

LSM-tree (Log-Structured Merge-tree) is the storage engine architecture behind RocksDB, LevelDB, Cassandra, and HBase. It trades read amplification for dramatically higher write throughput by converting random writes into sequential I/O. This post covers the complete low-level design: write path, SSTable format, compaction strategies, bloom filters, and crash recovery.

Write Path

Every write follows two steps before acknowledging to the client:

  1. WAL append: The key-value pair is appended to the Write-Ahead Log on disk (O_DSYNC or fsync per batch). This ensures durability — if the process crashes, the WAL is replayed on restart to reconstruct in-memory state.
  2. Memtable insert: The key-value is inserted into the in-memory memtable, implemented as a sorted skiplist or AVL tree. Reads can be served immediately from the memtable after write.

When the memtable exceeds a size threshold (e.g., 64 MB), it is frozen into an immutable memtable and a new empty memtable is created. A background thread flushes the immutable memtable to disk as a new SSTable (Level 0).

SSTable Format

An SSTable (Sorted String Table) is an immutable on-disk file with this structure:

  • Data blocks: fixed-size blocks (e.g., 4 KB) of sorted key-value pairs
  • Block index: maps the first key of each block to its file offset — enables binary search to the right block
  • Bloom filter: probabilistic membership structure stored in the file footer — eliminates I/O for keys definitely not in this SSTable
  • Footer: offsets to the block index and bloom filter, plus checksum

SSTables are immutable once written. Updates and deletes are represented as new entries: a delete is a tombstone entry with a special marker value.

Compaction Strategies

Without compaction, the number of SSTables grows unboundedly, degrading read performance. Two main strategies:

Size-Tiered Compaction (STCS): When enough SSTables of similar size accumulate, merge them into a larger SSTable. Write amplification is low (each byte is written fewer times), but space amplification is high (multiple versions of the same key coexist during compaction).

Leveled Compaction (LCS): SSTables are organized into levels. L0 allows overlapping key ranges. L1 and above maintain non-overlapping key ranges within each level. When a level exceeds its size limit, one SSTable is compacted into the next level. Read amplification is lower (at most one SSTable per level needs to be checked), but write amplification is higher.

Bloom Filters

Before reading an SSTable to look up a key, the engine checks the per-SSTable bloom filter. If the bloom filter returns absent, the SSTable is skipped entirely — no disk I/O needed. This eliminates unnecessary reads for keys that do not exist, which is the most common case for point lookups in sparse datasets.

Bloom filters have no false negatives: if the key is in the SSTable, the bloom filter always returns possibly present. False positives cause unnecessary SSTable reads but do not cause incorrect results.

Read Path

A read for a given key traverses in order, returning on first match:

  1. Active memtable
  2. Immutable memtables (most recent first)
  3. SSTable levels, youngest (L0) to oldest (Lmax) — bloom filter checked before each SSTable

A block cache (LRU) stores recently accessed SSTable blocks in memory, dramatically improving repeated read performance for hot key ranges.

WAL Recovery

On restart after a crash, the engine replays the WAL from the last checkpoint. Each WAL entry is a key-value write. Replaying rebuilds the memtable state that existed at crash time. Once the memtable is reconstructed, the engine is ready to serve traffic. After a successful memtable flush to SSTable, the corresponding WAL segment is truncated.

SQL Schema (Metadata)

-- SSTable file registry
CREATE TABLE SSTableMeta (
    id                  UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    level               INTEGER NOT NULL,
    min_key             BYTEA NOT NULL,
    max_key             BYTEA NOT NULL,
    file_path           TEXT NOT NULL UNIQUE,
    size_bytes          BIGINT NOT NULL,
    bloom_filter_fp_rate FLOAT NOT NULL,
    created_at          TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE INDEX idx_sst_level ON SSTableMeta(level, min_key, max_key);

-- Compaction job tracking
CREATE TABLE CompactionJob (
    id               UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    input_sstables   JSONB NOT NULL,   -- array of SSTableMeta IDs
    output_sstable_id UUID REFERENCES SSTableMeta(id),
    started_at       TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    completed_at     TIMESTAMPTZ,
    bytes_read       BIGINT,
    bytes_written    BIGINT
);

Python Interface

import os
import struct
import bisect
from typing import Optional

class BloomFilter:
    def __init__(self, m_bits: int, k_hashes: int):
        self.m = m_bits
        self.k = k_hashes
        self.bits = bytearray((m_bits + 7) // 8)

    def add(self, key: bytes) -> None:
        for seed in range(self.k):
            h = hash((key, seed)) % self.m
            self.bits[h // 8] |= (1 < bool:
        for seed in range(self.k):
            h = hash((key, seed)) % self.m
            if not (self.bits[h // 8] & (1 < None:
        self._size += len(key) + len(value)
        self._data[key] = value

    def read(self, key: bytes) -> Optional[bytes]:
        return self._data.get(key)

    def is_full(self) -> bool:
        return self._size >= self._threshold

    def sorted_items(self) -> list[tuple[bytes, bytes]]:
        return sorted(self._data.items())


class LSMEngine:
    TOMBSTONE = b"x00__TOMBSTONE__"

    def __init__(self, data_dir: str):
        self.data_dir = data_dir
        self.memtable = Memtable()
        self.immutable: list[Memtable] = []
        self.sstable_levels: list[list[dict]] = [[] for _ in range(7)]
        self._wal_path = os.path.join(data_dir, "wal.log")
        self._wal = open(self._wal_path, "ab")

    def write(self, key: bytes, value: bytes) -> None:
        """Append to WAL, then insert into memtable. Flush if memtable full."""
        entry = struct.pack(">I", len(key)) + key + struct.pack(">I", len(value)) + value
        self._wal.write(entry)
        self._wal.flush()
        self.memtable.write(key, value)
        if self.memtable.is_full():
            self.flush_memtable()

    def delete(self, key: bytes) -> None:
        """Write a tombstone entry."""
        self.write(key, self.TOMBSTONE)

    def read(self, key: bytes) -> Optional[bytes]:
        """Read from memtable, then immutables, then SSTable levels."""
        v = self.memtable.read(key)
        if v is not None:
            return None if v == self.TOMBSTONE else v
        for imm in reversed(self.immutable):
            v = imm.read(key)
            if v is not None:
                return None if v == self.TOMBSTONE else v
        for level in self.sstable_levels:
            for sst in reversed(level):
                if sst["bloom"].possibly_contains(key):
                    v = self._read_from_sstable(sst, key)
                    if v is not None:
                        return None if v == self.TOMBSTONE else v
        return None

    def flush_memtable(self) -> None:
        """Freeze active memtable and write it to an L0 SSTable."""
        imm = self.memtable
        self.memtable = Memtable()
        self.immutable.append(imm)
        sst = self._write_sstable(imm.sorted_items(), level=0)
        self.sstable_levels[0].append(sst)
        self.immutable.remove(imm)

    def compact_level(self, level: int) -> None:
        """Merge all SSTables at the given level into level+1."""
        if level >= len(self.sstable_levels) - 1:
            return
        all_items: dict[bytes, bytes] = {}
        for sst in self.sstable_levels[level]:
            for k, v in self._iter_sstable(sst):
                if k not in all_items:
                    all_items[k] = v
        new_sst = self._write_sstable(sorted(all_items.items()), level=level + 1)
        self.sstable_levels[level].clear()
        self.sstable_levels[level + 1].append(new_sst)

    def _write_sstable(self, items: list[tuple[bytes, bytes]], level: int) -> dict:
        import tempfile
        m_bits = max(1024, len(items) * 10)
        bloom = BloomFilter(m_bits=m_bits, k_hashes=7)
        for k, _ in items:
            bloom.add(k)
        path = os.path.join(self.data_dir, f"sst_L{level}_{id(items)}.sst")
        return {"path": path, "bloom": bloom, "items": dict(items), "level": level}

    def _read_from_sstable(self, sst: dict, key: bytes) -> Optional[bytes]:
        return sst["items"].get(key)

    def _iter_sstable(self, sst: dict):
        return sst["items"].items()

    def recover_from_wal(self) -> None:
        """On startup, replay WAL to reconstruct memtable."""
        if not os.path.exists(self._wal_path):
            return
        with open(self._wal_path, "rb") as f:
            while True:
                header = f.read(4)
                if len(header) I", header)[0]
                key = f.read(key_len)
                val_len = struct.unpack(">I", f.read(4))[0]
                val = f.read(val_len)
                self.memtable.write(key, val)

Design Trade-offs

Write amplification vs read amplification: LSM-trees have low write amplification at L0 (sequential flush) but compaction multiplies bytes written at deeper levels. Leveled compaction can have 10–30x write amplification for the deepest levels. STCS has lower write amplification but higher read amplification due to overlapping key ranges.

Bloom filter false positive rate: A false positive rate of 1% with k=7 hash functions requires approximately 9.6 bits per key. For 100M keys that is ~120 MB — worth the memory to eliminate the disk reads for absent keys.

WAL truncation: The WAL segment corresponding to a flushed memtable is safe to delete only after the SSTable is durably written and the engine metadata is updated. Premature truncation can cause data loss on recovery.

See also: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence

See also: Databricks Interview Guide 2026: Spark Internals, Delta Lake, and Lakehouse Architecture

See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering

Scroll to Top