Designing a distributed key-value store is a foundational system design question that tests your understanding of storage engines, replication, partitioning, and consistency. Amazon DynamoDB, Apache Cassandra, and Redis are production key-value stores with different design tradeoffs. This guide covers the internal architecture of an LSM-tree-based key-value store — from write path to compaction — with the depth expected at senior engineering interviews.
Write Path: WAL + Memtable + SSTable
The write path is optimized for speed using the Log-Structured Merge-tree (LSM-tree): (1) Write-Ahead Log (WAL) — every write is first appended to a sequential log file on disk. This guarantees durability: if the process crashes, replay the WAL to recover in-flight writes. WAL writes are sequential I/O — the fastest operation on any storage device. (2) Memtable — after the WAL write, the key-value pair is inserted into an in-memory sorted data structure (typically a red-black tree or skip list). The memtable serves both reads and writes with O(log N) performance. (3) Flush to SSTable — when the memtable reaches a size threshold (e.g., 64 MB), it is written to disk as an SSTable (Sorted String Table) — an immutable, sorted file of key-value pairs. The SSTable includes an index block (sparse index of keys to byte offsets) and a Bloom filter (for fast “key not present” checks). After flushing, the corresponding WAL segment is deleted. This design converts random writes into sequential writes (WAL append + sorted memtable flush), achieving high write throughput even on spinning disks.
Read Path: Memtable + SSTables + Bloom Filters
To read a key: (1) Check the memtable — if the key is present, return the value. This is the fastest path (in-memory). (2) Check SSTables from newest to oldest — each SSTable is immutable and sorted. Use the Bloom filter to quickly determine if the key might be in this SSTable (if the Bloom filter says “no,” skip it entirely — saving a disk read). If “maybe yes,” binary search the SSTable index to find the data block, then scan the block for the key. (3) Return the first match found (newest SSTable wins — this handles updates and deletes correctly since newer data overwrites older). Read amplification: in the worst case (key does not exist), the read must check every SSTable. With 10 levels of SSTables, this could mean 10 disk reads. Bloom filters mitigate this: with a 1% false positive rate per SSTable, the probability of a unnecessary disk read is very low. Compaction (discussed below) also reduces the number of SSTables. Caching: a block cache (LRU cache of recently read SSTable blocks) in memory reduces disk reads for hot keys. Typical cache hit rates exceed 90% for workloads with key access skew.
Compaction Strategies
As SSTables accumulate, read performance degrades (more files to check). Compaction merges SSTables to reduce their count and remove deleted/overwritten entries. Two strategies: (1) Size-tiered compaction — group SSTables of similar size. When N SSTables of similar size exist, merge them into one larger SSTable. Simple, good write throughput. Downside: space amplification (old data exists in multiple SSTables until compacted) and read amplification (many SSTables to check at each level). Cassandra default. (2) Leveled compaction — organize SSTables into levels. Level 0 contains freshly flushed SSTables. Each subsequent level is 10x larger. When level L exceeds its size limit, one SSTable from level L is merged with overlapping SSTables from level L+1. Key property: at each level (except level 0), SSTables have non-overlapping key ranges. This guarantees at most one SSTable per level needs to be checked for any key — reducing read amplification. Downside: higher write amplification (each key may be rewritten through multiple levels). RocksDB and LevelDB use leveled compaction by default. Choice: size-tiered for write-heavy workloads (fewer rewrites). Leveled for read-heavy workloads (fewer SSTables to check per read).
Partitioning and Replication
A single-node key-value store cannot handle terabytes of data or thousands of requests per second. Partitioning distributes data across multiple nodes. Consistent hashing: each key is hashed to a position on a hash ring. Each node owns a range of the ring (with virtual nodes for even distribution). A key is stored on the node that owns its hash position. Adding a node moves only ~1/N of keys (minimal disruption). Replication: each key is replicated to R nodes (typically R=3) for durability and availability. The coordinator node receives the write and forwards it to R-1 replica nodes. Quorum reads and writes: write to W nodes, read from R nodes. If W + R > N (total replicas), reads are guaranteed to see the latest write (quorum overlap). DynamoDB defaults: N=3, W=2, R=2. Strong consistency option: read from the leader replica (which has all committed writes). Eventual consistency: read from any replica (may return stale data but lower latency). Hinted handoff: if a replica node is temporarily down, the coordinator stores the write locally (a “hint”) and forwards it when the replica recovers. This provides availability during transient failures.
Handling Deletes: Tombstones
In an LSM-tree, data is never modified in place — SSTables are immutable. A delete writes a special marker called a tombstone: key = “user:123”, value = DELETED, timestamp = now. The tombstone is written to the memtable and eventually flushed to an SSTable. When reading key “user:123,” the system finds the tombstone (newest entry) and returns “not found.” During compaction, when the tombstone and the original entry are in SSTables being merged, both are removed (the tombstone has served its purpose). Tombstone retention: tombstones must be kept long enough for all replicas to receive the delete. If a tombstone is removed before a lagging replica processes it, the replica may “resurrect” the deleted data. Cassandra uses gc_grace_seconds (default 10 days) — tombstones are kept for 10 days before compaction removes them. This gives replicas ample time to sync. Too many tombstones degrade read performance (the read must skip over deleted entries). Range queries over heavily deleted data can be very slow. Monitor tombstone counts and compact aggressively for tables with frequent deletes.
DynamoDB Design Decisions
DynamoDB makes specific tradeoffs for managed simplicity: (1) Single-table design — DynamoDB encourages storing multiple entity types in one table with carefully designed partition keys and sort keys. This enables efficient access patterns without JOINs (which DynamoDB does not support). (2) Provisioned vs on-demand capacity — provisioned mode: specify read and write capacity units upfront (cheaper for predictable workloads). On-demand: pay per request with automatic scaling (simpler but more expensive for steady traffic). (3) Global Secondary Indexes (GSI) — create alternate access patterns. A GSI is a full copy of the table with a different partition key. Writes to the base table are asynchronously replicated to GSIs (eventual consistency for GSI reads). (4) DynamoDB Streams — a change data capture stream of all item-level modifications. Consumers (Lambda, Kinesis) process changes in near real-time. Enables event sourcing, cross-region replication, and materialized views. (5) Transactions — DynamoDB supports ACID transactions across multiple items and tables (TransactWriteItems, TransactGetItems). Limited to 100 items per transaction. Uses an optimistic concurrency protocol internally. In system design interviews: DynamoDB is the simplest answer for key-value workloads on AWS. Mention it when you need single-digit millisecond reads/writes at any scale with zero operational overhead.