System Design Interview: Design a Key-Value Store (Redis / DynamoDB)

What Is a Distributed Key-Value Store?

A key-value store maps arbitrary keys to values with O(1) average operations. Distributed key-value stores (Redis Cluster, DynamoDB, Cassandra, etcd) partition data across nodes for horizontal scalability and replicate for fault tolerance. Core trade-offs involve the CAP theorem: choose between consistency and availability when network partitions occur.

  • Netflix Interview Guide
  • Cloudflare Interview Guide
  • Uber Interview Guide
  • Databricks Interview Guide
  • Stripe Interview Guide
  • Coinbase Interview Guide
  • Data Partitioning

    Consistent Hashing

    Map both nodes and keys onto a ring of hash values (0 to 2^64). A key is assigned to the first node clockwise from its hash position. Virtual nodes (vnodes): each physical node gets 150–200 virtual positions on the ring, distributing load more evenly and reducing the impact of adding/removing nodes. Adding a node: only keys between the new node and its predecessor migrate — O(K/N) keys move, not O(K). Without consistent hashing: adding one node would require rehashing all K keys to N+1 nodes (O(K) migration).

    Replication

    Replicate each key to N neighbors clockwise on the ring (N=3 is typical). Replication factor = 3 means data survives 2 node failures. Write quorum W: how many replicas must acknowledge before write is considered successful. Read quorum R: how many replicas must respond before read is returned. Strong consistency: R + W > N (e.g., R=2, W=2, N=3). Eventual consistency: W=1, R=1 — fast but stale reads possible. Tunable consistency (DynamoDB): caller specifies ConsistencyLevel per request.

    Storage Engine

    LSM-Tree (Log-Structured Merge-Tree): the standard for write-heavy KV stores (LevelDB, RocksDB, Cassandra).

    1. Writes go to WAL (crash recovery) + in-memory MemTable (sorted)
    2. MemTable flushed to SSTable (immutable sorted file) when full
    3. SSTables compacted in background — merge-sort, remove tombstones
    4. Read: check MemTable, then L0 SSTables, then L1, L2… (bloom filters accelerate misses)

    B-Tree (InnoDB, PostgreSQL): better for read-heavy workloads. In-place updates, fewer amplification issues. Higher write cost than LSM.

    Conflict Resolution

    Concurrent writes to the same key on different replicas create conflicts. Strategies:

    • Last-Write-Wins (LWW): timestamp determines winner. Simple but loses concurrent writes (clock skew risks)
    • Vector clocks: each replica increments its own counter on write. Conflict detected when clocks are incomparable. Client resolves (Amazon shopping cart CRDT)
    • CRDTs: data type designed to merge automatically (G-Counter, 2P-Set)

    Failure Detection

    Gossip protocol: each node periodically sends its local view of which nodes are up/down to a random neighbor. Rumors spread exponentially — after O(log N) rounds, all nodes know about a failure. Suspicion mechanism: mark nodes as SUSPECT before FAILED to handle slow nodes. Phi-accrual failure detector: instead of binary up/down, outputs a probability of failure that increases with time since last heartbeat. Used by Cassandra.

    Interview Framework

    1. Partitioning: consistent hashing with virtual nodes
    2. Replication: N=3, configurable R and W quorums
    3. Storage: LSM-Tree for write-heavy, B-Tree for read-heavy
    4. Consistency: CAP trade-off, vector clocks for conflict detection
    5. Failure detection: gossip + phi-accrual

    {
    “@context”: “https://schema.org”,
    “@type”: “FAQPage”,
    “mainEntity”: [
    {
    “@type”: “Question”,
    “name”: “How does consistent hashing prevent hotspots in a distributed KV store?”,
    “acceptedAnswer”: { “@type”: “Answer”, “text”: “In simple modulo hashing (key % N), adding or removing a node requires rehashing ~K*(N-1)/N keys — expensive. Consistent hashing maps both nodes and keys to a hash ring (0 to 2^64). A key is stored on the first node clockwise from its hash. Adding a node: only keys between the new node and its predecessor move. Removing a node: only that node's keys move to its successor. ~K/N keys move in either case. Without virtual nodes: nodes are unevenly placed on the ring, causing load imbalance. A node with a large arc serves more keys. Virtual nodes: each physical node gets 150–200 random positions on the ring. Load is distributed proportionally to vnode count. Hot key problem (one key gets disproportionate traffic): consistent hashing alone doesn't help — hot keys need application-level sharding (e.g., append random suffix to key for cache sharding) or client-side request coalescing.” }
    },
    {
    “@type”: “Question”,
    “name”: “How does an LSM-Tree work and why is it better than B-Tree for write-heavy workloads?”,
    “acceptedAnswer”: { “@type”: “Answer”, “text”: “B-Tree updates in place: every write requires finding the right page and modifying it on disk. Random I/O. Write amplification: one logical write may cause multiple disk writes (page splits, parent updates). LSM-Tree (Log-Structured Merge-Tree): all writes are sequential. Write goes to: (1) WAL (append-only on disk, crash recovery), (2) MemTable (in-memory sorted structure, AVL tree or skip list). When MemTable reaches a threshold (e.g., 64MB): flush to disk as an SSTable (Sorted String Table) — sequential write, fast. SSTables are immutable. Background compaction merges SSTables, removes deleted keys (tombstones). Read: check MemTable first, then SSTables from newest to oldest (bloom filter per SSTable to skip misses). Write amplification in LSM: data is written multiple times during compaction — but total writes are sequential (fast). B-Tree reads are faster (one lookup path). LSM reads are slower (may check multiple SSTables). Trade-off: LSM wins on write throughput; B-Tree wins on read latency. That's why LSM is used by RocksDB, Cassandra, LevelDB; B-Tree by PostgreSQL, MySQL (InnoDB).” }
    },
    {
    “@type”: “Question”,
    “name”: “What is the difference between eventual consistency and strong consistency in distributed KV stores?”,
    “acceptedAnswer”: { “@type”: “Answer”, “text”: “Strong consistency (linearizability): every read returns the most recently written value, as if all operations executed on a single machine. Requires all reads to go through a quorum or a single leader. Latency: higher (cross-datacenter quorum adds RTT). Systems: etcd, ZooKeeper, Google Spanner, DynamoDB with ConsistentRead. Eventual consistency: reads may return stale data, but all replicas converge to the same value given enough time and no new writes. No quorum required for reads. Latency: lower (any replica responds). Systems: Cassandra with ONE consistency level, DynamoDB without ConsistentRead, Redis replication. When to use strong consistency: financial transactions (current balance must be exact), leader election (exactly one leader must be seen). When to use eventual consistency: social media likes/views (approximately correct is fine), user session data (slightly stale login info is acceptable), shopping cart (merge conflicts client-side). The CAP theorem states you cannot have both strong consistency and availability during network partitions — choose one. Most real systems (DynamoDB, Cassandra) offer tunable consistency per request.” }
    }
    ]
    }

    Scroll to Top