The Log-Structured Merge-tree (LSM-tree) is a data structure optimized for write-heavy workloads. Unlike B-trees that update pages in place (causing random writes to disk), LSM-trees convert all writes to sequential appends, dramatically improving write throughput on spinning disks and even SSDs. LSM-trees power RocksDB, LevelDB, Cassandra, HBase, InfluxDB, and many other databases. Understanding LSM-trees is critical for explaining write amplification, read amplification, and the fundamental storage trade-offs in distributed databases.
Write Path: Memtable + WAL
Every write goes to two places simultaneously: the Write-Ahead Log (WAL, sequential append to disk for durability) and the in-memory sorted structure called the memtable (a skip list or red-black tree keeping keys sorted). The write returns as soon as both are written — no disk seek required. When the memtable reaches a size threshold (e.g., 64MB), it is frozen and flushed to disk as an immutable sorted file called an SSTable (Sorted String Table). A new memtable is then allocated. Reads check the memtable first (most recent data), then SSTables from newest to oldest.
// LSM-tree write path
func (lsm *LSMTree) Put(key, value []byte) error {
// 1. Append to WAL for durability
lsm.wal.Append(key, value)
// 2. Insert into active memtable (skip list, sorted)
lsm.memtable.Put(key, value)
// 3. If memtable exceeds threshold, flush to SSTable
if lsm.memtable.Size() > lsm.config.MemtableSize {
go lsm.flushMemtable() // async flush
}
return nil
}
// SSTable: sorted, immutable, compressed file on disk
// Format: [data block][data block]...[index block][bloom filter][footer]
// Index block: sparse index (first key of each data block)
// Bloom filter: probabilistic membership test to skip SSTables
SSTable Format and Compaction
SSTables accumulate on disk over time. Without compaction, a key lookup must scan all SSTables (O(files) disk reads). Compaction merges multiple SSTables into larger sorted files, removing stale versions of keys and tombstones (deletion markers). Two main compaction strategies: Leveled compaction (RocksDB default): SSTables organized in levels (L0, L1, L2…). L0 contains recent unflushed SSTables; each lower level is 10x larger. Compaction merges SSTables from one level into the next. Bounded read amplification (at most one SSTable per level to check). Higher write amplification. Size-tiered compaction (Cassandra default): merge SSTables of similar size. Lower write amplification, higher read amplification and space amplification.
Read Path and Bloom Filters
A point read (get by key) checks in order: active memtable, immutable memtables being flushed, then SSTables from newest to oldest. Without optimization, this is O(number of SSTables) disk reads — expensive. Bloom filters solve this: each SSTable has a Bloom filter of its keys. Before searching an SSTable, check the filter. If it returns negative, skip the SSTable entirely (zero disk I/O). If positive, search the SSTable index and data block. This reduces read amplification from O(SSTables) to near O(1) for most point reads. Range reads still require scanning all SSTables.
Amplification Trade-offs
LSM-trees have three amplification factors: Write amplification: each byte written to the database may be written multiple times as it moves through compaction levels. Leveled compaction has higher write amplification (10-30x) than size-tiered (5-10x). Read amplification: a point read may check multiple SSTables. Leveled compaction bounds this (one SSTable per level); size-tiered can be worse. Space amplification: stale key versions in multiple SSTables consume extra disk space until compaction removes them. Size-tiered has higher space amplification. No strategy minimizes all three simultaneously — choose based on workload (write-heavy = size-tiered; read-heavy = leveled).
Key Interview Discussion Points
- LSM-tree vs. B-tree: B-tree has lower read amplification (one tree traversal) but higher write amplification (in-place updates cause random writes); LSM-tree has lower write amplification but higher read amplification without Bloom filters
- Tombstones: deletes insert a tombstone marker; the key is not actually removed until compaction clears all older versions — tombstone accumulation degrades read performance
- RocksDB tuning: block cache size (caches decompressed data blocks in memory), compression (zstd for cold data, lz4 for hot data), compaction style, and L0 file count trigger are the key parameters
- TIERED storage: pin L0/L1 SSTables on NVMe, L2/L3 on HDD, L4+ on S3 — tiered storage using the compaction level hierarchy
- Wide column stores (Cassandra HBase): partition key routes to a node, clustering key sorts within the SSTable — the LSM-tree is the per-node storage engine