Low Level Design: LSM-Tree Internals — Log-Structured Merge Trees

⏱ 3 min read

The Log-Structured Merge-tree (LSM-tree) is a data structure optimized for write-heavy workloads. Unlike B-trees that update pages in place (causing random writes to disk), LSM-trees convert all writes to sequential appends, dramatically improving write throughput on spinning disks and even SSDs. LSM-trees power RocksDB, LevelDB, Cassandra, HBase, InfluxDB, and many other databases. Understanding LSM-trees is critical for explaining write amplification, read amplification, and the fundamental storage trade-offs in distributed databases.

Write Path: Memtable + WAL

Every write goes to two places simultaneously: the Write-Ahead Log (WAL, sequential append to disk for durability) and the in-memory sorted structure called the memtable (a skip list or red-black tree keeping keys sorted). The write returns as soon as both are written — no disk seek required. When the memtable reaches a size threshold (e.g., 64MB), it is frozen and flushed to disk as an immutable sorted file called an SSTable (Sorted String Table). A new memtable is then allocated. Reads check the memtable first (most recent data), then SSTables from newest to oldest.

// LSM-tree write path
func (lsm *LSMTree) Put(key, value []byte) error {
    // 1. Append to WAL for durability
    lsm.wal.Append(key, value)

    // 2. Insert into active memtable (skip list, sorted)
    lsm.memtable.Put(key, value)

    // 3. If memtable exceeds threshold, flush to SSTable
    if lsm.memtable.Size() > lsm.config.MemtableSize {
        go lsm.flushMemtable()  // async flush
    }
    return nil
}

// SSTable: sorted, immutable, compressed file on disk
// Format: [data block][data block]...[index block][bloom filter][footer]
// Index block: sparse index (first key of each data block)
// Bloom filter: probabilistic membership test to skip SSTables

SSTable Format and Compaction

SSTables accumulate on disk over time. Without compaction, a key lookup must scan all SSTables (O(files) disk reads). Compaction merges multiple SSTables into larger sorted files, removing stale versions of keys and tombstones (deletion markers). Two main compaction strategies: Leveled compaction (RocksDB default): SSTables organized in levels (L0, L1, L2…). L0 contains recent unflushed SSTables; each lower level is 10x larger. Compaction merges SSTables from one level into the next. Bounded read amplification (at most one SSTable per level to check). Higher write amplification. Size-tiered compaction (Cassandra default): merge SSTables of similar size. Lower write amplification, higher read amplification and space amplification.

Read Path and Bloom Filters

A point read (get by key) checks in order: active memtable, immutable memtables being flushed, then SSTables from newest to oldest. Without optimization, this is O(number of SSTables) disk reads — expensive. Bloom filters solve this: each SSTable has a Bloom filter of its keys. Before searching an SSTable, check the filter. If it returns negative, skip the SSTable entirely (zero disk I/O). If positive, search the SSTable index and data block. This reduces read amplification from O(SSTables) to near O(1) for most point reads. Range reads still require scanning all SSTables.

Amplification Trade-offs

LSM-trees have three amplification factors: Write amplification: each byte written to the database may be written multiple times as it moves through compaction levels. Leveled compaction has higher write amplification (10-30x) than size-tiered (5-10x). Read amplification: a point read may check multiple SSTables. Leveled compaction bounds this (one SSTable per level); size-tiered can be worse. Space amplification: stale key versions in multiple SSTables consume extra disk space until compaction removes them. Size-tiered has higher space amplification. No strategy minimizes all three simultaneously — choose based on workload (write-heavy = size-tiered; read-heavy = leveled).

Key Interview Discussion Points

LSM-tree vs. B-tree: B-tree has lower read amplification (one tree traversal) but higher write amplification (in-place updates cause random writes); LSM-tree has lower write amplification but higher read amplification without Bloom filters
Tombstones: deletes insert a tombstone marker; the key is not actually removed until compaction clears all older versions — tombstone accumulation degrades read performance
RocksDB tuning: block cache size (caches decompressed data blocks in memory), compression (zstd for cold data, lz4 for hot data), compaction style, and L0 file count trigger are the key parameters
TIERED storage: pin L0/L1 SSTables on NVMe, L2/L3 on HDD, L4+ on S3 — tiered storage using the compaction level hierarchy
Wide column stores (Cassandra HBase): partition key routes to a node, clustering key sorts within the SSTable — the LSM-tree is the per-node storage engine