LSM-Tree Storage Engine Low-Level Design: Memtable, SSTables, Compaction, and Read Path Optimization

LSM-tree (Log-Structured Merge-tree) is the storage engine architecture behind RocksDB, LevelDB, Cassandra, and HBase. It trades read amplification for dramatically higher write throughput by converting random writes into sequential I/O. This post covers the complete low-level design: write path, SSTable format, compaction strategies, bloom filters, and crash recovery.

Write Path

Every write follows two steps before acknowledging to the client:

WAL append: The key-value pair is appended to the Write-Ahead Log on disk (O_DSYNC or fsync per batch). This ensures durability — if the process crashes, the WAL is replayed on restart to reconstruct in-memory state.
Memtable insert: The key-value is inserted into the in-memory memtable, implemented as a sorted skiplist or AVL tree. Reads can be served immediately from the memtable after write.

When the memtable exceeds a size threshold (e.g., 64 MB), it is frozen into an immutable memtable and a new empty memtable is created. A background thread flushes the immutable memtable to disk as a new SSTable (Level 0).

SSTable Format

An SSTable (Sorted String Table) is an immutable on-disk file with this structure:

Data blocks: fixed-size blocks (e.g., 4 KB) of sorted key-value pairs
Block index: maps the first key of each block to its file offset — enables binary search to the right block
Bloom filter: probabilistic membership structure stored in the file footer — eliminates I/O for keys definitely not in this SSTable
Footer: offsets to the block index and bloom filter, plus checksum

SSTables are immutable once written. Updates and deletes are represented as new entries: a delete is a tombstone entry with a special marker value.

Compaction Strategies

Without compaction, the number of SSTables grows unboundedly, degrading read performance. Two main strategies:

Size-Tiered Compaction (STCS): When enough SSTables of similar size accumulate, merge them into a larger SSTable. Write amplification is low (each byte is written fewer times), but space amplification is high (multiple versions of the same key coexist during compaction).

Leveled Compaction (LCS): SSTables are organized into levels. L0 allows overlapping key ranges. L1 and above maintain non-overlapping key ranges within each level. When a level exceeds its size limit, one SSTable is compacted into the next level. Read amplification is lower (at most one SSTable per level needs to be checked), but write amplification is higher.

Bloom Filters

Before reading an SSTable to look up a key, the engine checks the per-SSTable bloom filter. If the bloom filter returns absent, the SSTable is skipped entirely — no disk I/O needed. This eliminates unnecessary reads for keys that do not exist, which is the most common case for point lookups in sparse datasets.

Bloom filters have no false negatives: if the key is in the SSTable, the bloom filter always returns possibly present. False positives cause unnecessary SSTable reads but do not cause incorrect results.

Read Path

A read for a given key traverses in order, returning on first match:

Active memtable
Immutable memtables (most recent first)
SSTable levels, youngest (L0) to oldest (Lmax) — bloom filter checked before each SSTable

A block cache (LRU) stores recently accessed SSTable blocks in memory, dramatically improving repeated read performance for hot key ranges.

WAL Recovery

On restart after a crash, the engine replays the WAL from the last checkpoint. Each WAL entry is a key-value write. Replaying rebuilds the memtable state that existed at crash time. Once the memtable is reconstructed, the engine is ready to serve traffic. After a successful memtable flush to SSTable, the corresponding WAL segment is truncated.

SQL Schema (Metadata)

-- SSTable file registry
CREATE TABLE SSTableMeta (
    id                  UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    level               INTEGER NOT NULL,
    min_key             BYTEA NOT NULL,
    max_key             BYTEA NOT NULL,
    file_path           TEXT NOT NULL UNIQUE,
    size_bytes          BIGINT NOT NULL,
    bloom_filter_fp_rate FLOAT NOT NULL,
    created_at          TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE INDEX idx_sst_level ON SSTableMeta(level, min_key, max_key);

-- Compaction job tracking
CREATE TABLE CompactionJob (
    id               UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    input_sstables   JSONB NOT NULL,   -- array of SSTableMeta IDs
    output_sstable_id UUID REFERENCES SSTableMeta(id),
    started_at       TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    completed_at     TIMESTAMPTZ,
    bytes_read       BIGINT,
    bytes_written    BIGINT
);

Python Interface

import os
import struct
import bisect
from typing import Optional

class BloomFilter:
    def __init__(self, m_bits: int, k_hashes: int):
        self.m = m_bits
        self.k = k_hashes
        self.bits = bytearray((m_bits + 7) // 8)

    def add(self, key: bytes) -> None:
        for seed in range(self.k):
            h = hash((key, seed)) % self.m
            self.bits[h // 8] |= (1 < bool:
        for seed in range(self.k):
            h = hash((key, seed)) % self.m
            if not (self.bits[h // 8] & (1 < None:
        self._size += len(key) + len(value)
        self._data[key] = value

    def read(self, key: bytes) -> Optional[bytes]:
        return self._data.get(key)

    def is_full(self) -> bool:
        return self._size >= self._threshold

    def sorted_items(self) -> list[tuple[bytes, bytes]]:
        return sorted(self._data.items())


class LSMEngine:
    TOMBSTONE = b"x00__TOMBSTONE__"

    def __init__(self, data_dir: str):
        self.data_dir = data_dir
        self.memtable = Memtable()
        self.immutable: list[Memtable] = []
        self.sstable_levels: list[list[dict]] = [[] for _ in range(7)]
        self._wal_path = os.path.join(data_dir, "wal.log")
        self._wal = open(self._wal_path, "ab")

    def write(self, key: bytes, value: bytes) -> None:
        """Append to WAL, then insert into memtable. Flush if memtable full."""
        entry = struct.pack(">I", len(key)) + key + struct.pack(">I", len(value)) + value
        self._wal.write(entry)
        self._wal.flush()
        self.memtable.write(key, value)
        if self.memtable.is_full():
            self.flush_memtable()

    def delete(self, key: bytes) -> None:
        """Write a tombstone entry."""
        self.write(key, self.TOMBSTONE)

    def read(self, key: bytes) -> Optional[bytes]:
        """Read from memtable, then immutables, then SSTable levels."""
        v = self.memtable.read(key)
        if v is not None:
            return None if v == self.TOMBSTONE else v
        for imm in reversed(self.immutable):
            v = imm.read(key)
            if v is not None:
                return None if v == self.TOMBSTONE else v
        for level in self.sstable_levels:
            for sst in reversed(level):
                if sst["bloom"].possibly_contains(key):
                    v = self._read_from_sstable(sst, key)
                    if v is not None:
                        return None if v == self.TOMBSTONE else v
        return None

    def flush_memtable(self) -> None:
        """Freeze active memtable and write it to an L0 SSTable."""
        imm = self.memtable
        self.memtable = Memtable()
        self.immutable.append(imm)
        sst = self._write_sstable(imm.sorted_items(), level=0)
        self.sstable_levels[0].append(sst)
        self.immutable.remove(imm)

    def compact_level(self, level: int) -> None:
        """Merge all SSTables at the given level into level+1."""
        if level >= len(self.sstable_levels) - 1:
            return
        all_items: dict[bytes, bytes] = {}
        for sst in self.sstable_levels[level]:
            for k, v in self._iter_sstable(sst):
                if k not in all_items:
                    all_items[k] = v
        new_sst = self._write_sstable(sorted(all_items.items()), level=level + 1)
        self.sstable_levels[level].clear()
        self.sstable_levels[level + 1].append(new_sst)

    def _write_sstable(self, items: list[tuple[bytes, bytes]], level: int) -> dict:
        import tempfile
        m_bits = max(1024, len(items) * 10)
        bloom = BloomFilter(m_bits=m_bits, k_hashes=7)
        for k, _ in items:
            bloom.add(k)
        path = os.path.join(self.data_dir, f"sst_L{level}_{id(items)}.sst")
        return {"path": path, "bloom": bloom, "items": dict(items), "level": level}

    def _read_from_sstable(self, sst: dict, key: bytes) -> Optional[bytes]:
        return sst["items"].get(key)

    def _iter_sstable(self, sst: dict):
        return sst["items"].items()

    def recover_from_wal(self) -> None:
        """On startup, replay WAL to reconstruct memtable."""
        if not os.path.exists(self._wal_path):
            return
        with open(self._wal_path, "rb") as f:
            while True:
                header = f.read(4)
                if len(header) I", header)[0]
                key = f.read(key_len)
                val_len = struct.unpack(">I", f.read(4))[0]
                val = f.read(val_len)
                self.memtable.write(key, val)

Design Trade-offs

Write amplification vs read amplification: LSM-trees have low write amplification at L0 (sequential flush) but compaction multiplies bytes written at deeper levels. Leveled compaction can have 10–30x write amplification for the deepest levels. STCS has lower write amplification but higher read amplification due to overlapping key ranges.

Bloom filter false positive rate: A false positive rate of 1% with k=7 hash functions requires approximately 9.6 bits per key. For 100M keys that is ~120 MB — worth the memory to eliminate the disk reads for absent keys.

WAL truncation: The WAL segment corresponding to a flushed memtable is safe to delete only after the SSTable is durably written and the engine metadata is updated. Premature truncation can cause data loss on recovery.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What is the difference between write amplification and read amplification in an LSM-tree?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Write amplification is the ratio of bytes written to disk versus bytes of user data written — caused by compaction rewriting SSTables multiple times. Read amplification is the number of disk reads required per user read — caused by checking multiple SSTable levels. LSM-trees optimize for low write amplification (sequential writes) at the cost of higher read amplification compared to B-trees. Bloom filters and block caches reduce the effective read amplification.”
}
},
{
“@type”: “Question”,
“name”: “How does bloom filter false positive rate affect LSM-tree read performance?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A false positive causes the engine to read an SSTable block that does not contain the queried key, wasting an I/O. At 1% false positive rate with many levels, each point lookup triggers roughly 0.01 extra I/Os per SSTable per level. Lower false positive rates (0.1%) cost more memory (13 bits per key vs 9.6 bits at 1%) but reduce unnecessary I/O. The optimal rate depends on read latency requirements and available memory.”
}
},
{
“@type”: “Question”,
“name”: “When should you choose leveled compaction over size-tiered compaction?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Choose leveled compaction (LCS) when read latency and space amplification matter most — LCS ensures non-overlapping key ranges per level, so at most one SSTable per level is read per lookup, and space amplification is bounded (typically 1.1x). Choose size-tiered compaction (STCS) for write-heavy workloads where write amplification is the bottleneck — STCS merges similar-sized SSTables less frequently, reducing compaction I/O at the cost of higher space usage and read amplification.”
}
},
{
“@type”: “Question”,
“name”: “When is it safe to truncate WAL segments in an LSM-tree engine?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A WAL segment can be safely truncated only after the corresponding memtable has been fully flushed to an SSTable on disk and the SSTable file is durably synced (fsync complete). Truncating before flush completion means that a subsequent crash loses data that cannot be recovered. Most implementations write a manifest or checkpoint record that marks which WAL offset has been durably captured in SSTables, and truncate only below that offset.”
}
}
]
}

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How does a bloom filter reduce unnecessary SSTable reads?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Before reading an SSTable for a key, the bloom filter answers “definitely not present” or “possibly present”; a “definitely not” response skips the entire SSTable file, eliminating disk I/O for keys that hash to a missing SSTable.”
}
},
{
“@type”: “Question”,
“name”: “What triggers an SSTable flush from memtable?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A flush is triggered when the memtable exceeds a size threshold (e.g., 64MB) or when a configurable time limit elapses; the memtable is converted to an immutable SSTable and written sequentially to disk.”
}
},
{
“@type”: “Question”,
“name”: “How does leveled compaction maintain the invariant of non-overlapping key ranges per level?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “When L0 SSTables accumulate past a threshold, they are compacted with all overlapping SSTables in L1; the merged output replaces them with new L1 SSTables that have non-overlapping key ranges; this invariant ensures a single-pass key lookup per level.”
}
},
{
“@type”: “Question”,
“name”: “How is WAL replay scoped after a checkpoint?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The checkpoint records the LSN of the last flushed memtable; on recovery, only WAL entries with LSN greater than the checkpoint LSN need to be replayed, reducing recovery time proportional to the amount of unflushed data at crash time.”
}
}
]
}