Low Level Design: LSM-Tree Internals — Log-Structured Merge Trees

The Log-Structured Merge-tree (LSM-tree) is a data structure optimized for write-heavy workloads. Unlike B-trees that update pages in place (causing random writes to disk), LSM-trees convert all writes to sequential appends, dramatically improving write throughput on spinning disks and even SSDs. LSM-trees power RocksDB, LevelDB, Cassandra, HBase, InfluxDB, and many other databases. Understanding LSM-trees is critical for explaining write amplification, read amplification, and the fundamental storage trade-offs in distributed databases.

Write Path: Memtable + WAL

Every write goes to two places simultaneously: the Write-Ahead Log (WAL, sequential append to disk for durability) and the in-memory sorted structure called the memtable (a skip list or red-black tree keeping keys sorted). The write returns as soon as both are written — no disk seek required. When the memtable reaches a size threshold (e.g., 64MB), it is frozen and flushed to disk as an immutable sorted file called an SSTable (Sorted String Table). A new memtable is then allocated. Reads check the memtable first (most recent data), then SSTables from newest to oldest.

// LSM-tree write path
func (lsm *LSMTree) Put(key, value []byte) error {
    // 1. Append to WAL for durability
    lsm.wal.Append(key, value)

    // 2. Insert into active memtable (skip list, sorted)
    lsm.memtable.Put(key, value)

    // 3. If memtable exceeds threshold, flush to SSTable
    if lsm.memtable.Size() > lsm.config.MemtableSize {
        go lsm.flushMemtable()  // async flush
    }
    return nil
}

// SSTable: sorted, immutable, compressed file on disk
// Format: [data block][data block]...[index block][bloom filter][footer]
// Index block: sparse index (first key of each data block)
// Bloom filter: probabilistic membership test to skip SSTables

SSTable Format and Compaction

SSTables accumulate on disk over time. Without compaction, a key lookup must scan all SSTables (O(files) disk reads). Compaction merges multiple SSTables into larger sorted files, removing stale versions of keys and tombstones (deletion markers). Two main compaction strategies: Leveled compaction (RocksDB default): SSTables organized in levels (L0, L1, L2…). L0 contains recent unflushed SSTables; each lower level is 10x larger. Compaction merges SSTables from one level into the next. Bounded read amplification (at most one SSTable per level to check). Higher write amplification. Size-tiered compaction (Cassandra default): merge SSTables of similar size. Lower write amplification, higher read amplification and space amplification.

Read Path and Bloom Filters

A point read (get by key) checks in order: active memtable, immutable memtables being flushed, then SSTables from newest to oldest. Without optimization, this is O(number of SSTables) disk reads — expensive. Bloom filters solve this: each SSTable has a Bloom filter of its keys. Before searching an SSTable, check the filter. If it returns negative, skip the SSTable entirely (zero disk I/O). If positive, search the SSTable index and data block. This reduces read amplification from O(SSTables) to near O(1) for most point reads. Range reads still require scanning all SSTables.

Amplification Trade-offs

LSM-trees have three amplification factors: Write amplification: each byte written to the database may be written multiple times as it moves through compaction levels. Leveled compaction has higher write amplification (10-30x) than size-tiered (5-10x). Read amplification: a point read may check multiple SSTables. Leveled compaction bounds this (one SSTable per level); size-tiered can be worse. Space amplification: stale key versions in multiple SSTables consume extra disk space until compaction removes them. Size-tiered has higher space amplification. No strategy minimizes all three simultaneously — choose based on workload (write-heavy = size-tiered; read-heavy = leveled).

Key Interview Discussion Points

  • LSM-tree vs. B-tree: B-tree has lower read amplification (one tree traversal) but higher write amplification (in-place updates cause random writes); LSM-tree has lower write amplification but higher read amplification without Bloom filters
  • Tombstones: deletes insert a tombstone marker; the key is not actually removed until compaction clears all older versions — tombstone accumulation degrades read performance
  • RocksDB tuning: block cache size (caches decompressed data blocks in memory), compression (zstd for cold data, lz4 for hot data), compaction style, and L0 file count trigger are the key parameters
  • TIERED storage: pin L0/L1 SSTables on NVMe, L2/L3 on HDD, L4+ on S3 — tiered storage using the compaction level hierarchy
  • Wide column stores (Cassandra HBase): partition key routes to a node, clustering key sorts within the SSTable — the LSM-tree is the per-node storage engine
Scroll to Top