Designing a distributed key-value store is a foundational system design question that tests your understanding of storage engines, replication, partitioning, and consistency. Amazon DynamoDB, Apache Cassandra, and Redis are production key-value stores with different design tradeoffs. This guide covers the internal architecture of an LSM-tree-based key-value store — from write path to compaction — with the depth expected at senior engineering interviews.
Write Path: WAL + Memtable + SSTable
The write path is optimized for speed using the Log-Structured Merge-tree (LSM-tree): (1) Write-Ahead Log (WAL) — every write is first appended to a sequential log file on disk. This guarantees durability: if the process crashes, replay the WAL to recover in-flight writes. WAL writes are sequential I/O — the fastest operation on any storage device. (2) Memtable — after the WAL write, the key-value pair is inserted into an in-memory sorted data structure (typically a red-black tree or skip list). The memtable serves both reads and writes with O(log N) performance. (3) Flush to SSTable — when the memtable reaches a size threshold (e.g., 64 MB), it is written to disk as an SSTable (Sorted String Table) — an immutable, sorted file of key-value pairs. The SSTable includes an index block (sparse index of keys to byte offsets) and a Bloom filter (for fast “key not present” checks). After flushing, the corresponding WAL segment is deleted. This design converts random writes into sequential writes (WAL append + sorted memtable flush), achieving high write throughput even on spinning disks.
Read Path: Memtable + SSTables + Bloom Filters
To read a key: (1) Check the memtable — if the key is present, return the value. This is the fastest path (in-memory). (2) Check SSTables from newest to oldest — each SSTable is immutable and sorted. Use the Bloom filter to quickly determine if the key might be in this SSTable (if the Bloom filter says “no,” skip it entirely — saving a disk read). If “maybe yes,” binary search the SSTable index to find the data block, then scan the block for the key. (3) Return the first match found (newest SSTable wins — this handles updates and deletes correctly since newer data overwrites older). Read amplification: in the worst case (key does not exist), the read must check every SSTable. With 10 levels of SSTables, this could mean 10 disk reads. Bloom filters mitigate this: with a 1% false positive rate per SSTable, the probability of a unnecessary disk read is very low. Compaction (discussed below) also reduces the number of SSTables. Caching: a block cache (LRU cache of recently read SSTable blocks) in memory reduces disk reads for hot keys. Typical cache hit rates exceed 90% for workloads with key access skew.
Compaction Strategies
As SSTables accumulate, read performance degrades (more files to check). Compaction merges SSTables to reduce their count and remove deleted/overwritten entries. Two strategies: (1) Size-tiered compaction — group SSTables of similar size. When N SSTables of similar size exist, merge them into one larger SSTable. Simple, good write throughput. Downside: space amplification (old data exists in multiple SSTables until compacted) and read amplification (many SSTables to check at each level). Cassandra default. (2) Leveled compaction — organize SSTables into levels. Level 0 contains freshly flushed SSTables. Each subsequent level is 10x larger. When level L exceeds its size limit, one SSTable from level L is merged with overlapping SSTables from level L+1. Key property: at each level (except level 0), SSTables have non-overlapping key ranges. This guarantees at most one SSTable per level needs to be checked for any key — reducing read amplification. Downside: higher write amplification (each key may be rewritten through multiple levels). RocksDB and LevelDB use leveled compaction by default. Choice: size-tiered for write-heavy workloads (fewer rewrites). Leveled for read-heavy workloads (fewer SSTables to check per read).
Partitioning and Replication
A single-node key-value store cannot handle terabytes of data or thousands of requests per second. Partitioning distributes data across multiple nodes. Consistent hashing: each key is hashed to a position on a hash ring. Each node owns a range of the ring (with virtual nodes for even distribution). A key is stored on the node that owns its hash position. Adding a node moves only ~1/N of keys (minimal disruption). Replication: each key is replicated to R nodes (typically R=3) for durability and availability. The coordinator node receives the write and forwards it to R-1 replica nodes. Quorum reads and writes: write to W nodes, read from R nodes. If W + R > N (total replicas), reads are guaranteed to see the latest write (quorum overlap). DynamoDB defaults: N=3, W=2, R=2. Strong consistency option: read from the leader replica (which has all committed writes). Eventual consistency: read from any replica (may return stale data but lower latency). Hinted handoff: if a replica node is temporarily down, the coordinator stores the write locally (a “hint”) and forwards it when the replica recovers. This provides availability during transient failures.
Handling Deletes: Tombstones
In an LSM-tree, data is never modified in place — SSTables are immutable. A delete writes a special marker called a tombstone: key = “user:123”, value = DELETED, timestamp = now. The tombstone is written to the memtable and eventually flushed to an SSTable. When reading key “user:123,” the system finds the tombstone (newest entry) and returns “not found.” During compaction, when the tombstone and the original entry are in SSTables being merged, both are removed (the tombstone has served its purpose). Tombstone retention: tombstones must be kept long enough for all replicas to receive the delete. If a tombstone is removed before a lagging replica processes it, the replica may “resurrect” the deleted data. Cassandra uses gc_grace_seconds (default 10 days) — tombstones are kept for 10 days before compaction removes them. This gives replicas ample time to sync. Too many tombstones degrade read performance (the read must skip over deleted entries). Range queries over heavily deleted data can be very slow. Monitor tombstone counts and compact aggressively for tables with frequent deletes.
DynamoDB Design Decisions
DynamoDB makes specific tradeoffs for managed simplicity: (1) Single-table design — DynamoDB encourages storing multiple entity types in one table with carefully designed partition keys and sort keys. This enables efficient access patterns without JOINs (which DynamoDB does not support). (2) Provisioned vs on-demand capacity — provisioned mode: specify read and write capacity units upfront (cheaper for predictable workloads). On-demand: pay per request with automatic scaling (simpler but more expensive for steady traffic). (3) Global Secondary Indexes (GSI) — create alternate access patterns. A GSI is a full copy of the table with a different partition key. Writes to the base table are asynchronously replicated to GSIs (eventual consistency for GSI reads). (4) DynamoDB Streams — a change data capture stream of all item-level modifications. Consumers (Lambda, Kinesis) process changes in near real-time. Enables event sourcing, cross-region replication, and materialized views. (5) Transactions — DynamoDB supports ACID transactions across multiple items and tables (TransactWriteItems, TransactGetItems). Limited to 100 items per transaction. Uses an optimistic concurrency protocol internally. In system design interviews: DynamoDB is the simplest answer for key-value workloads on AWS. Mention it when you need single-digit millisecond reads/writes at any scale with zero operational overhead.
{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”How does an LSM-tree storage engine handle writes efficiently?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”The LSM-tree converts random writes into sequential writes for high throughput. Write path: (1) Append to Write-Ahead Log (WAL) — a sequential log file on disk. Guarantees durability: replay on crash recovery. Sequential I/O is the fastest disk operation. (2) Insert into memtable — an in-memory sorted structure (red-black tree or skip list). Serves both reads and writes. (3) Flush to SSTable — when the memtable reaches a size threshold (e.g., 64 MB), it is written to disk as an immutable, sorted SSTable file with an index block and Bloom filter. The WAL segment is then deleted. All disk writes are sequential: WAL is append-only, SSTable flush is a single sequential write. This design achieves 10-100x better write throughput than B-tree-based databases (which require random I/O to update pages in place). Used by: Cassandra, RocksDB, LevelDB, InfluxDB, and the LSM-based storage engines underlying DynamoDB.”}},{“@type”:”Question”,”name”:”What is compaction and why is it necessary in LSM-tree databases?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”As SSTables accumulate from memtable flushes, read performance degrades — each read may need to check multiple SSTables. Compaction merges SSTables to reduce their count and remove deleted/overwritten entries. Two strategies: Size-tiered compaction: merge N similar-sized SSTables into one larger file. Good write throughput, but higher space amplification (old data exists in multiple files until compacted) and higher read amplification (many files to check). Cassandra default. Leveled compaction: organize SSTables into levels (each 10x larger). When a level exceeds its limit, merge one SSTable with overlapping files in the next level. Key property: each level (except L0) has non-overlapping key ranges, so at most one SSTable per level needs checking per read. Better read performance but higher write amplification (keys are rewritten through multiple levels). RocksDB default. Choose size-tiered for write-heavy workloads, leveled for read-heavy.”}},{“@type”:”Question”,”name”:”How do Bloom filters speed up reads in a key-value store?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”When reading a key, the storage engine must check each SSTable from newest to oldest until the key is found. For a non-existent key, every SSTable is checked — potentially dozens of disk reads. A Bloom filter per SSTable answers does this SSTable possibly contain the key? in O(1) with no disk I/O. If the Bloom filter says no (definitely not present), skip the SSTable entirely. If yes (probably present), proceed with the SSTable lookup. With a 1% false positive rate, 99% of unnecessary SSTable checks are eliminated. Memory cost: approximately 10 bits per key. For an SSTable with 1 million keys: 1.25 MB of memory. This dramatically reduces read amplification — the number of SSTables that must be physically read for a single key lookup. Without Bloom filters, reading a non-existent key in a system with 10 SSTable levels requires 10 disk reads. With Bloom filters: statistically 0.1 unnecessary reads.”}},{“@type”:”Question”,”name”:”How does DynamoDB achieve single-digit millisecond latency at any scale?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”DynamoDB design decisions for consistent low latency: (1) Consistent hashing partitioning — each table is partitioned by the partition key hash. Requests are routed directly to the correct storage partition. No scatter-gather. (2) SSD storage — all data is on SSDs for fast random reads. The LSM-tree write path is sequential, and reads hit the SSD when not in cache. (3) Automatic scaling — partitions split when they reach capacity limits (10 GB data or throughput limits). New partitions are distributed across the fleet. (4) Request routing — the request router maintains a partition map. Given a partition key, it routes directly to the correct storage node in one network hop. (5) Caching — DynamoDB Accelerator (DAX) provides microsecond reads for hot keys via an in-memory cache. (6) Replication — each partition is replicated across 3 AZs using Paxos consensus. Strong consistency reads go to the leader; eventually consistent reads can hit any replica. The combination of partitioning, direct routing, and SSD storage ensures that read/write latency stays under 10ms regardless of table size.”}}]}