Design a distributed key-value store like DynamoDB, Cassandra (its KV layer), or Redis Cluster. This is one of the most technically deep system design problems — it forces you to reason about consistency models, replication, failure detection, and storage engines simultaneously.
Requirements Clarification
- API:
get(key),put(key, value),delete(key). No range scans. - Scale: 10 billion keys, each value up to 10KB. ~1PB total data.
- Latency: p99 < 10ms for reads and writes.
- Availability: 99.999% (five nines). Tolerate node failures without downtime.
- Consistency: Tunable — strong consistency for financial data, eventual for social features. Clarify with the interviewer which model to design for.
- Durability: Data persists through node failures and restarts.
Back-of-Envelope
- 1PB data, 10KB per value → 100 billion keys… recalculate: 1PB / 10KB = 100M keys per node if nodes hold 10TB each → 100 nodes. With 3× replication → 300 nodes total.
- 100K writes/sec, 500K reads/sec — typical OLTP distribution
- Each node: 10TB SSD, 64GB RAM, 10Gbps NIC
Core Architecture: Consistent Hashing Ring
The fundamental problem: how do you decide which node stores key K? Naive modulo hashing breaks when you add or remove nodes — nearly every key remaps. The solution is consistent hashing.
Arrange all nodes on a ring (hash space 0 to 2^32). A key’s responsible node is the first node clockwise from hash(key). Add a node: only the keys between the new node and its predecessor remapped. Remove a node: only its keys move to the next node.
Virtual nodes (vnodes): Each physical node owns 100–200 positions on the ring. This provides better load distribution (avoids hot spots from uneven hashing) and faster rebalancing when nodes join or leave. DynamoDB uses vnodes; Cassandra calls them “tokens.”
Replication
Each key is stored on N nodes (typically N=3). The primary node (coordinator) replicates to the next N-1 nodes clockwise on the ring.
Key "user:123" hashes to position P
→ Stored on nodes at positions P, P+1, P+2 (ring order)
Replication is synchronous or asynchronous depending on your consistency model. Asynchronous replication: coordinator writes to primary, returns success immediately, replicates in background. Low latency but risk of data loss on failure. Synchronous: wait for W replicas to acknowledge before returning.
Consistency: Quorum Reads and Writes
With N replicas, define W (write quorum) and R (read quorum). For strong consistency: W + R > N.
N=3, W=2, R=2: W+R=4 > 3 → strong consistency
N=3, W=1, R=3: also strong, but slow reads
N=3, W=1, R=1: eventual consistency, fastest, but stale reads possible
DynamoDB’s default (before eventually consistent reads): W=1, R=1 for speed. Strongly consistent reads: R=N/2+1.
At read time with eventual consistency: different replicas may have different values. The coordinator reads from R nodes and uses conflict resolution to pick a winner.
Conflict Resolution
Last-writer-wins (LWW): Every write gets a timestamp. Conflict resolved by taking the highest timestamp. Simple, but relies on synchronized clocks. AWS uses this with HLC (Hybrid Logical Clocks) to handle clock skew.
Vector clocks: Each key tracks a vector of (node, version) pairs. When writes on different nodes diverge, the vector clock detects the conflict. The application (or a predefined rule) resolves it. Amazon’s shopping cart used vector clocks to merge two concurrent add-to-cart operations rather than picking one.
Most production systems use LWW for simplicity and accept the rare data loss risk. Only choose vector clocks if the merge semantics are clear (e.g., CRDT-friendly data types).
Failure Detection: Gossip Protocol
In a 300-node cluster, you can’t have every node ping every other node — that’s O(N²) messages. Instead, use gossip:
- Each node maintains a membership list with heartbeat counters
- Every second, each node randomly picks K other nodes and exchanges its membership list
- Received heartbeats update local counters. A node not seen for T seconds is marked suspect, then failed.
Gossip propagates information in O(log N) rounds — exponentially fast. Cassandra’s gossip protocol is the reference implementation.
Storage Engine: LSM Tree
Why not a B-tree? B-trees require random writes (in-place updates), which are slow on SSDs and HDDs at scale. LSM (Log-Structured Merge) trees convert random writes to sequential:
- MemTable: All writes go to an in-memory sorted tree (Red-Black or skip list). Also appended to a write-ahead log (WAL) for durability.
- SSTable: When MemTable hits a size threshold, it’s flushed to disk as an immutable sorted file (SSTable).
- Compaction: Background process merges SSTables, removing obsolete versions and deleted keys. Keeps read performance from degrading.
Read path: check MemTable → check bloom filter (does this SSTable contain the key?) → binary search in SSTable.
Bloom filters are critical for read performance. Each SSTable has a bloom filter — a probabilistic set membership structure. Check the bloom filter before touching disk; if it says “not present,” skip that SSTable entirely. 1% false positive rate, ~10 bits per element.
Handling Node Failures: Hinted Handoff
Node B is down. A write arrives for a key owned by B. The coordinator routes the write to node C temporarily, with a “hint” that it belongs to B. When B recovers, C delivers the hint. This maintains write availability during transient failures.
For extended failures: anti-entropy repair. Nodes periodically exchange Merkle trees of their data. A Merkle tree is a hash tree where leaf nodes hash individual key ranges. Comparing root hashes detects divergence without transferring all data. Only the differing leaf ranges need synchronization.
Request Routing
How does a client know which node to contact? Three options:
- Client-side routing: Client holds a copy of the ring and contacts the coordinator directly. Low latency. Used by Cassandra drivers.
- Router tier: Dedicated routing nodes know the ring, proxy requests. Simpler client. Extra hop.
- Any-node routing: Client contacts any node; that node forwards to the coordinator. Simplest client, two hops.
Trade-offs Summary
| Decision | Option A | Option B |
|---|---|---|
| Consistency | Quorum (W+R>N) — strong but slower | W=1,R=1 — fast but stale reads |
| Conflict resolution | LWW — simple, rare data loss | Vector clocks — merge-friendly, complex |
| Storage engine | LSM — write-optimized, compaction overhead | B-tree — read-optimized, random write penalty |
| Failure handling | Hinted handoff — fast recovery | Synchronous replication — no hints needed |
Interview Follow-ups
- How do you handle hot keys? (celebrity keys where one key gets 10% of all traffic)
- How would you add range scan support? What changes in the architecture?
- Explain how compaction works and when you’d choose size-tiered vs leveled compaction.
- How does your system handle a network partition between two data centers?
- How would you implement TTL (key expiration) efficiently?
Related System Design Topics
- Consistent Hashing — the ring that maps keys to nodes; virtual nodes for balanced load
- CAP Theorem — tunable W+R quorums let you choose your consistency/availability point
- Database Sharding — sharding strategies and how consistent hashing differs from range-based sharding
- Caching Strategies — Redis is essentially a managed distributed KV store with richer data types
- Message Queues — async replication and gossip event propagation use queue semantics
See also: Design Google Docs: Collaborative Editing and CRDTs — distributed KV store patterns applied to document storage with CRDT conflict resolution.