Question 1

How does an LSM-tree storage engine handle writes efficiently?

Accepted Answer

The LSM-tree converts random writes into sequential writes for high throughput. Write path: (1) Append to Write-Ahead Log (WAL) -- a sequential log file on disk. Guarantees durability: replay on crash recovery. Sequential I/O is the fastest disk operation. (2) Insert into memtable -- an in-memory sorted structure (red-black tree or skip list). Serves both reads and writes. (3) Flush to SSTable -- when the memtable reaches a size threshold (e.g., 64 MB), it is written to disk as an immutable, sorted SSTable file with an index block and Bloom filter. The WAL segment is then deleted. All disk writes are sequential: WAL is append-only, SSTable flush is a single sequential write. This design achieves 10-100x better write throughput than B-tree-based databases (which require random I/O to update pages in place). Used by: Cassandra, RocksDB, LevelDB, InfluxDB, and the LSM-based storage engines underlying DynamoDB.

Question 2

What is compaction and why is it necessary in LSM-tree databases?

Accepted Answer

As SSTables accumulate from memtable flushes, read performance degrades -- each read may need to check multiple SSTables. Compaction merges SSTables to reduce their count and remove deleted/overwritten entries. Two strategies: Size-tiered compaction: merge N similar-sized SSTables into one larger file. Good write throughput, but higher space amplification (old data exists in multiple files until compacted) and higher read amplification (many files to check). Cassandra default. Leveled compaction: organize SSTables into levels (each 10x larger). When a level exceeds its limit, merge one SSTable with overlapping files in the next level. Key property: each level (except L0) has non-overlapping key ranges, so at most one SSTable per level needs checking per read. Better read performance but higher write amplification (keys are rewritten through multiple levels). RocksDB default. Choose size-tiered for write-heavy workloads, leveled for read-heavy.

Question 3

How do Bloom filters speed up reads in a key-value store?

Accepted Answer

When reading a key, the storage engine must check each SSTable from newest to oldest until the key is found. For a non-existent key, every SSTable is checked -- potentially dozens of disk reads. A Bloom filter per SSTable answers does this SSTable possibly contain the key? in O(1) with no disk I/O. If the Bloom filter says no (definitely not present), skip the SSTable entirely. If yes (probably present), proceed with the SSTable lookup. With a 1% false positive rate, 99% of unnecessary SSTable checks are eliminated. Memory cost: approximately 10 bits per key. For an SSTable with 1 million keys: 1.25 MB of memory. This dramatically reduces read amplification -- the number of SSTables that must be physically read for a single key lookup. Without Bloom filters, reading a non-existent key in a system with 10 SSTable levels requires 10 disk reads. With Bloom filters: statistically 0.1 unnecessary reads.

Question 4

How does DynamoDB achieve single-digit millisecond latency at any scale?

Accepted Answer

DynamoDB design decisions for consistent low latency: (1) Consistent hashing partitioning -- each table is partitioned by the partition key hash. Requests are routed directly to the correct storage partition. No scatter-gather. (2) SSD storage -- all data is on SSDs for fast random reads. The LSM-tree write path is sequential, and reads hit the SSD when not in cache. (3) Automatic scaling -- partitions split when they reach capacity limits (10 GB data or throughput limits). New partitions are distributed across the fleet. (4) Request routing -- the request router maintains a partition map. Given a partition key, it routes directly to the correct storage node in one network hop. (5) Caching -- DynamoDB Accelerator (DAX) provides microsecond reads for hot keys via an in-memory cache. (6) Replication -- each partition is replicated across 3 AZs using Paxos consensus. Strong consistency reads go to the leader; eventually consistent reads can hit any replica. The combination of partitioning, direct routing, and SSD storage ensures that read/write latency stays under 10ms regardless of table size.

System Design: Key-Value Store (DynamoDB) — LSM Tree, SSTable, Memtable, Compaction, Bloom Filter, WAL

Write Path: WAL + Memtable + SSTable

Read Path: Memtable + SSTables + Bloom Filters

Compaction Strategies

Partitioning and Replication

Handling Deletes: Tombstones

DynamoDB Design Decisions