System Design: Design a Key-Value Store (like Redis or DynamoDB)
A key-value store is the simplest and most widely-used NoSQL data model. Designing one from scratch is a common system design interview question that tests your knowledge of hashing, replication, consistency, and persistence.
Requirements
Functional: put(key, value), get(key), delete(key). Optional: TTL expiry, range queries, transactions.
Non-functional: low latency (P99 < 10ms), high availability (99.99%), eventual or strong consistency (clarify!), durability (survive restarts), horizontal scalability to terabytes.
Single-Node Design
Start simple: in-memory hash map. O(1) get/put. Issues: limited by RAM, data lost on restart.
Persistence options:
- Append-only log (WAL) — write every operation to disk sequentially before returning success. Fast writes, slow recovery (replay entire log).
- Periodic snapshot (RDB) — dump full state to disk every N seconds. Fast recovery, potential data loss between snapshots. Redis uses both.
- LSM tree + SSTable — write to in-memory memtable, flush sorted to disk as immutable SSTables, compact in background. Used by LevelDB, RocksDB, Cassandra. Excellent write throughput.
- B-tree — update data in-place on disk pages. Used by PostgreSQL, MySQL. Better read performance, slower writes due to random I/O.
Distributed Design
Partitioning (Sharding)
Distribute keys across nodes using consistent hashing. Each node owns a range of the hash ring. Virtual nodes (150 per physical node) ensure even distribution. When a node is added, only adjacent keys migrate.
# Key → partition → node mapping
partition_id = hash(key) % num_partitions
node = partition_to_node_map[partition_id]
Replication
Each partition has N replicas (N=3 is common). Two strategies:
- Leader-follower (primary-replica): all writes go to leader, replicated to followers. Simple, strong consistency reads from leader. Used by Redis Sentinel, DynamoDB.
- Leaderless (Dynamo-style): client writes to W nodes, reads from R nodes. No single point of failure. W + R > N for strong consistency. DynamoDB, Cassandra, Riak.
Consistency Model
Use quorum reads/writes to tune consistency vs availability:
- N=3, W=3, R=1 → write to all, fast reads, high write latency
- N=3, W=1, R=3 → fast writes, slow reads
- N=3, W=2, R=2 → balanced, W+R=4>3 guarantees overlap
Handling Failures
- Hinted handoff: when target node is down, another node temporarily stores writes with a hint to forward when target recovers
- Anti-entropy / Merkle trees: background process compares tree hashes between replicas and syncs divergent subtrees
- Vector clocks: track causality for conflict detection; last-write-wins is simpler but loses data
Architecture Diagram
Client
│
▼
Coordinator Node (any node in cluster)
│ consistent hash → determines N replica nodes
├──▶ Replica 1 (leader/primary)
├──▶ Replica 2
└──▶ Replica 3
Each replica: WAL + MemTable + SSTables on disk
TTL / Expiry
Two approaches:
- Lazy expiry: check expiry on each read, delete if expired. Simple, may keep dead keys in memory.
- Active expiry: background thread scans a random sample of keys with TTL and deletes expired ones. Redis uses both.
Comparison: Redis vs DynamoDB vs Cassandra
| Dimension | Redis | DynamoDB | Cassandra |
|---|---|---|---|
| Consistency | Strong (single node) | Eventual / strong | Tunable |
| Data model | Rich types (list, set, sorted set) | Document + KV | Wide column |
| Persistence | Optional (RDB + AOF) | Always durable | Always durable |
| Latency | Sub-ms (in-memory) | Single-digit ms | Low ms |
| Scale | Cluster up to TBs | Petabytes (managed) | Petabytes |
Interview Checklist
- Clarify: strong vs eventual consistency? Read-heavy or write-heavy? TTL needed?
- Start single-node, then add partitioning and replication
- Explain your storage engine choice (hash map + WAL, LSM, B-tree) and trade-offs
- Discuss quorum W+R>N for consistency guarantees
- Address failure handling: hinted handoff, anti-entropy, conflict resolution
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What is an LSM tree and why do databases like Cassandra use it?”,
“acceptedAnswer”: { “@type”: “Answer”, “text”: “An LSM (Log-Structured Merge) tree buffers writes in an in-memory table (memtable), then flushes sorted immutable files (SSTables) to disk when the memtable is full. Background compaction merges SSTables to reclaim space. All writes are sequential—no random disk I/O—giving excellent write throughput. Reads check the memtable, bloom filters, and SSTables in order. Cassandra, LevelDB, and RocksDB use LSM trees for write-heavy workloads.” }
},
{
“@type”: “Question”,
“name”: “How does consistent hashing minimize data movement when nodes are added or removed?”,
“acceptedAnswer”: { “@type”: “Answer”, “text”: “In consistent hashing, both nodes and keys are mapped to positions on a virtual ring via a hash function. Each key is owned by the nearest node clockwise. When a node joins, it takes over keys from only its clockwise neighbor—roughly 1/n of total keys. When a node leaves, its keys shift to the next node. Compare to modular hashing (key % n) where adding a node remaps nearly all keys, causing a thundering herd on cache misses.” }
},
{
“@type”: “Question”,
“name”: “What is quorum consistency and how do you choose W and R values?”,
“acceptedAnswer”: { “@type”: “Answer”, “text”: “With N replicas, writing to W nodes and reading from R nodes guarantees you see the latest write when W + R > N (the read and write sets must overlap). Common configs for N=3: W=2, R=2 (strong consistency, balanced latency); W=1, R=3 (fast writes, slow reads); W=3, R=1 (durable writes, fast reads). For eventual consistency (analytics, counters), use W=1, R=1—maximum performance, possible stale reads.” }
}
]
}