System Design Interview: Design a Distributed Key-Value Store
A distributed key-value store is one of the foundational system design questions — it tests your understanding of distributed systems fundamentals: consistency, partitioning, replication, and fault tolerance. Systems like DynamoDB, Cassandra, and Redis Cluster solve this problem. Here is the architecture and the trade-offs you need to articulate in an interview.
Requirements
Functional: get(key), put(key, value), delete(key). Keys up to 256 bytes, values up to 10 MB.
Non-functional: low latency (<10ms p99), high availability (99.99%), horizontal scalability to petabytes of data, eventual consistency acceptable (tunable).
Data Partitioning: Consistent Hashing
Distribute keys across nodes without rehashing everything when nodes join or leave.
Virtual nodes (vnodes): each physical node gets 150-300 vnodes on the hash ring. Benefits: (1) even distribution even with heterogeneous hardware, (2) when a node fails, its vnodes scatter load across many nodes rather than one neighbor absorbing everything.
hash_ring: sorted array of (hash_value, node_id)
lookup(key): hash(key) → find first node clockwise on ring
add_node(new_node): take ~150 vnodes from existing nodes, redistribute only those keys
remove_node(dead_node): its vnodes transfer to the next clockwise node
Replication
Replicate each key to N nodes (typically N=3) for fault tolerance. With consistent hashing: the primary node plus the next N-1 clockwise nodes in the ring are the replica set for a key.
Quorum reads and writes: with N=3 replicas, W write quorum, R read quorum:
- Strong consistency: W + R > N (e.g., W=2, R=2)
- High availability (DynamoDB default): W=1, R=1 — eventual consistency, lowest latency
- Write-heavy optimization: W=1, R=3 — fast writes, strong reads
The coordinator node (receives the client request) sends the operation to all N replicas in parallel and returns after W (or R) acknowledge.
Storage Engine
On each node, use an LSM-tree (Log-Structured Merge-tree):
- Write path: write to WAL (write-ahead log) for durability, then to an in-memory MemTable (sorted by key, ~64MB)
- Flush: when MemTable fills, flush to an immutable SSTable (Sorted String Table) on disk
- Compaction: background process merges SSTables, removes deleted/overwritten keys, maintains sorted order
- Read path: check MemTable → check Bloom filter for each SSTable → read from disk if needed
Bloom filters eliminate unnecessary disk reads: if the filter says the key is not in an SSTable, skip it (false positive rate ~1%). This makes LSM reads nearly as fast as B-tree reads despite the multi-layer structure.
Conflict Resolution: Vector Clocks
With eventual consistency, a key can be written to multiple replicas concurrently before they sync. Detecting conflicts requires vector clocks:
vector_clock: {node_id: sequence_number}
# Write on node A: increment A's counter
# Example: after 3 writes across nodes A,B,C:
value1: {A:2, B:1} # written mostly on A
value2: {A:1, B:2} # written mostly on B
# These are concurrent (neither dominates) → conflict
# Resolution: last-write-wins (timestamp), or return both to client for application-level merge
DynamoDB uses a simplified version with “last writer wins” as default and optional conditional updates for optimistic locking.
Handling Failures
Hinted Handoff
When a replica node is temporarily down, the coordinator writes to a healthy node as a “hint.” When the failed node recovers, the hint is replayed to bring it up to date. This maintains write availability (sloppy quorum) during node failures.
Anti-Entropy with Merkle Trees
To detect data divergence between replicas without comparing every key:
- Each node builds a Merkle tree over its key space (leaf = hash of key-value, internal nodes = hash of children)
- Nodes exchange tree roots periodically; if roots differ, traverse to find divergent subtrees
- Sync only the divergent key ranges — typically O(differences) not O(total keys)
API Design
PUT /keys/{key} body: {value, ttl_seconds (optional)} → 200 OK
GET /keys/{key} → 200 {value, version} or 404
DELETE /keys/{key} → 200 OK (tombstone written, not immediate deletion)
Client SDK: consistent_put(key, value, quorum=W)
consistent_get(key, quorum=R, read_repair=True)
Read repair: when a GET reads from R replicas and finds divergent values, it asynchronously updates stale replicas with the latest value. This passively heals inconsistency without a separate background job.
Comparison: DynamoDB vs. Redis Cluster vs. Cassandra
| Feature | DynamoDB | Redis Cluster | Cassandra |
|---|---|---|---|
| Consistency | Eventual (tunable) | Eventual | Eventual (tunable) |
| Storage engine | B-tree + LSM | Hash table (in-memory) | LSM-tree |
| Max value size | 400 KB | 512 MB | 2 GB |
| Latency | <10ms | <1ms | <10ms |
| Use case | General purpose | Caching, sessions | Time-series, write-heavy |
Interview Tips
- Open with CAP theorem: you're designing an AP system (availability + partition tolerance), accepting eventual consistency
- Explain consistent hashing and why virtual nodes matter — this is a signal of depth
- Walk through the write path: client → coordinator → W replicas → WAL → MemTable → SSTable
- Mention Bloom filters when discussing reads — shows you know the storage engine internals
- For conflict resolution: “last-write-wins is simple; vector clocks are needed when you can't afford data loss”
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How does consistent hashing work and why are virtual nodes important?”,
“acceptedAnswer”: { “@type”: “Answer”, “text”: “Consistent hashing maps both nodes and keys onto a ring (0 to 2^32). A key is assigned to the first node clockwise from its hash position. Adding or removing a node only reassigns keys from its immediate neighbors — O(K/N) keys move instead of O(K). Virtual nodes (vnodes): each physical node gets 150-300 positions on the ring. This ensures even key distribution even with heterogeneous hardware, and distributes load across many nodes when one fails (instead of one neighbor absorbing everything). DynamoDB, Cassandra, and Riak all use consistent hashing with vnodes.” }
},
{
“@type”: “Question”,
“name”: “What is the quorum mechanism and how do you tune W and R?”,
“acceptedAnswer”: { “@type”: “Answer”, “text”: “With N replicas, a write succeeds after W acknowledgments and a read returns after R responses. Strong consistency requires W+R > N (guarantees overlap). Common configurations with N=3: (1) W=2, R=2 — balanced, strong consistency; (2) W=1, R=3 — fast writes, strong reads; (3) W=3, R=1 — strong writes, fast reads; (4) W=1, R=1 — maximum availability, eventual consistency (DynamoDB default). The coordinator waits for W/R responses and returns the latest version (highest timestamp or vector clock). Remaining replicas sync asynchronously.” }
},
{
“@type”: “Question”,
“name”: “How does an LSM-tree differ from a B-tree for a key-value store?”,
“acceptedAnswer”: { “@type”: “Answer”, “text”: “LSM-tree (Log-Structured Merge-tree): writes go to an in-memory MemTable, then flush to immutable SSTables. Compaction merges SSTables in the background. Write amplification is low (sequential writes). Read amplification is higher (check MemTable + multiple SSTables) — mitigated by Bloom filters. Used by: Cassandra, LevelDB, RocksDB, BigTable. B-tree: reads are O(log N) with predictable performance. Writes update pages in-place, requiring random I/O. Better read performance for point queries. Used by: PostgreSQL, MySQL, DynamoDB (partially). For write-heavy workloads (metrics, logs, events), LSM-tree wins. For read-heavy with random access, B-tree wins.” }
}
]
}