Low Level Design: LSM-Tree Storage Engine

⏱ 9 min read

Write Path

Every write in an LSM-tree storage engine lands first in two places simultaneously: the in-memory MemTable and the on-disk write-ahead log (WAL). The MemTable is an ordered data structure — typically a red-black tree or a skip list — that maintains keys in sorted order and supports O(log N) insert and lookup. Because it lives in memory, writes are fast and never cause random disk I/O. The WAL, written sequentially to disk, ensures that the MemTable contents survive a crash. When the MemTable grows beyond a configured size threshold (e.g., 64 MB in RocksDB defaults), it is frozen into an immutable MemTable and a new active MemTable is opened immediately so writes can continue without blocking. A background thread flushes the immutable MemTable to disk as a new SSTable file. The WAL segment corresponding to the flushed MemTable is then safe to delete. This design decouples write latency (memory speed) from flush latency (disk speed), achieving high write throughput at the cost of more complex read logic.

SSTable Structure

A Sorted String Table (SSTable) is an immutable, sorted, on-disk file produced by flushing a MemTable or by compaction. Its internal layout consists of several block types. Data blocks hold the actual sorted key-value pairs, typically compressed (Snappy, Zstd) and sized at 4 KB or configurable multiples thereof. Index blocks map the first key of each data block to its file offset, enabling binary search to locate the right data block without reading the entire file. A Bloom filter block stores a per-SSTable probabilistic membership structure used to answer “does this key exist in this file?” before paying the cost of any data block read. A metaindex block stores offsets to other meta blocks, and a fixed-size footer at the end of the file records the offsets of the index block and metaindex block. Within a data block, key-value entries use prefix compression: only the delta from the previous key is stored, reducing storage for keys with common prefixes (e.g., UUIDs sharing a namespace prefix). Block-level compression ratios of 2-4x are typical for text-heavy workloads.

Compaction

SSTables accumulate on disk as MemTables are flushed. Without compaction, point lookups degrade as more files must be checked, and deleted/overwritten keys waste space. Compaction merges multiple SSTables into fewer, larger ones, resolving duplicate keys by keeping only the entry with the highest sequence number and discarding tombstones (deletion markers) once they are guaranteed to be below the oldest active snapshot. Three major compaction strategies differ in their tradeoffs. Size-tiered compaction (Cassandra default, HBase) groups SSTables of similar size and merges groups when they accumulate; write amplification is low but space amplification is high (up to 2x live data during compaction). Leveled compaction (LevelDB, RocksDB default) organizes SSTables into levels where each level has a size limit 10x the previous; a level-L file is compacted with all overlapping files in level L+1; read amplification is bounded to O(number of levels) and space amplification is low (~1.1x), but write amplification is 10-30x. FIFO compaction drops the oldest SSTable when total size exceeds a limit — appropriate for time-series data where old data expires.

Bloom Filter

A Bloom filter is a space-efficient probabilistic data structure that answers set membership queries with no false negatives and a tunable false positive rate. Each SSTable carries its own Bloom filter built at flush time. The filter is parameterized by bits per key: at 10 bits/key the false positive rate is approximately 1%; at 20 bits/key it drops to ~0.1%. For a point lookup, the engine checks the SSTable’s Bloom filter before issuing any disk read — a negative result (key definitely absent) allows skipping the entire SSTable entirely, typically the common case in sparse workloads. Bloom filters are loaded into memory at SSTable open time and kept in the block cache. For a database with 100 SSTables and 10 bits/key filter, a lookup that yields no false positives touches only the MemTable and the one SSTable that contains the key, reading just two data blocks. Without Bloom filters, the same lookup might scan all 100 SSTables’ index blocks. The filter is the single most impactful optimization for LSM read performance on point lookups.

Read Path

A point lookup traverses a hierarchy of data sources newest-to-oldest. First the active MemTable is checked (O(log N) in the skip list or red-black tree). If not found, any immutable MemTable awaiting flush is checked. Then SSTables are searched from newest (Level 0) to oldest (higher levels), using each SSTable’s Bloom filter to skip files that cannot contain the key. When an SSTable is not ruled out by the Bloom filter, the index block is read to find the relevant data block offset, then the data block is read and decompressed to look up the key. Merging semantics handle multiple versions of the same key: the first (newest) occurrence with a live value wins; a tombstone means the key was deleted. Range queries bypass Bloom filters (a range cannot be filtered at the file level) and must consult every SSTable whose key range overlaps the query range, making range scan performance LSM’s weakest point compared to B-tree. The block cache (typically an LRU cache of decompressed blocks) is critical to read performance — hot data blocks serve reads without disk I/O.

WAL and Durability

The write-ahead log is an append-only sequential file on disk. Each write appends a record containing the key, value, sequence number, and operation type (Put or Delete). The WAL write is synchronous by default (fsync per record or per batch), ensuring durability before acknowledging the write to the client. Many deployments configure group commit: buffer WAL records for a short interval (e.g., 1 ms) and fsync once for the batch, amortizing fsync overhead across concurrent writers. On crash recovery, the engine reads the WAL from the last valid checkpoint, replays each record into a fresh MemTable, and then normal operation resumes. The recovery time is bounded by the size of the unflushed MemTable at crash time — typically under a second for 64 MB MemTables. Once a MemTable is successfully flushed to an SSTable and the SSTable is synced, the corresponding WAL segment is obsolete and deleted. Multiple column families or write groups each maintain their own WAL to limit recovery scope.

Read and Write Amplification

Write amplification (WA) is the ratio of bytes physically written to disk per byte of user data written. Each byte of user data is written once to the WAL, once when the MemTable is flushed to Level 0, and then re-written during each compaction step as it moves through levels. For leveled compaction with 10 levels and 10x size ratio, a key may be compacted roughly L times through L levels, yielding WA of ~30x in the worst case. Size-tiered compaction achieves WA of ~10x but with higher space amplification. Read amplification (RA) for a point lookup is bounded by the number of SSTables checked after Bloom filter elimination — typically close to 1 for leveled compaction on non-hot keys. Range scans see higher RA since Bloom filters cannot help; RA equals the number of SSTables with overlapping key ranges, which is at most O(L * files_per_level). LSM-trees trade read amplification for write throughput compared to B-trees. The right choice depends on workload: write-heavy append-style workloads (logging, event streams, time-series) favor LSM; read-heavy or random-read workloads favor B-tree.

Real-World Implementations

RocksDB (Facebook, 2012) is the dominant LSM-tree library, forked from Google’s LevelDB with production-grade features: column families, transactions, rate limiting, compaction filters, and extensive tuning knobs. It underpins MyRocks (MySQL), MongoRocks, and numerous distributed databases. Apache Cassandra uses an LSM-tree engine for its write-optimized storage, with size-tiered compaction as the default and leveled compaction available for read-heavy use cases. ScyllaDB reimplements Cassandra in C++ with a per-core LSM design for NUMA efficiency. Pebble is CockroachDB’s Go-native LSM engine, a clean-room reimplementation of RocksDB semantics with tighter integration into CockroachDB’s MVCC layer. HBase builds its LSM-tree on HDFS, using HDFS blocks as SSTable storage and ZooKeeper for coordination, targeting batch analytics workloads on Hadoop infrastructure.

What is the write path in an LSM-tree?

Writes are appended to the Write-Ahead Log (WAL) for durability, then inserted into the in-memory MemTable (a sorted structure, typically a skip list). When the MemTable reaches a size threshold, it is flushed to disk as an immutable SSTable file. This makes writes sequential, achieving high write throughput regardless of key distribution.

How does compaction work in an LSM-tree?

Compaction merges multiple SSTable files, deduplicating overwritten keys and discarding deleted (tombstoned) keys. Size-tiered compaction merges similarly sized SSTables. Leveled compaction (used by RocksDB and LevelDB) maintains per-level size limits and merges SSTables from one level into the next, producing fewer overlapping key ranges.

How does a Bloom filter improve LSM-tree read performance?

Each SSTable has an associated Bloom filter: a probabilistic data structure that answers ‘is this key possibly in this file?’ with no false negatives and a tunable false positive rate. On a point read, the engine checks the Bloom filter before doing any disk I/O. Files that definitely do not contain the key are skipped, reducing read amplification dramatically.

What is read amplification in an LSM-tree?

Read amplification is the number of disk reads required per logical read. In the worst case, a point read must check the MemTable plus every SSTable level. Bloom filters and per-level index blocks reduce this, but leveled compaction is specifically designed to minimize read amplification by bounding the number of overlapping SSTables per key.

How does RocksDB differ from LevelDB?

RocksDB is a fork of LevelDB with production-grade enhancements: column families for logical namespace isolation, multi-threaded compaction, compression per level, rate limiting for compaction I/O, statistics and monitoring hooks, and support for transactions. RocksDB is used as the storage engine in MySQL (MyRocks), CockroachDB, and TiKV.