Key-Value Store System: Overview
A key-value store is the simplest distributed database: keys map to opaque binary values. Despite its simplicity, designing one at low level requires careful choices about storage engine, hashing-based partitioning, replication topology, conflict resolution, and replica synchronization. This post covers each layer end-to-end.
Storage Engine: LSM-Tree
The Log-Structured Merge-tree (LSM-tree) is the dominant storage engine for write-heavy key-value stores (used by LevelDB, RocksDB, Cassandra).
Write Path
- Memtable: writes land in an in-memory sorted structure (red-black tree or skip list). O(log N) per write.
- WAL (Write-Ahead Log): simultaneously written to disk for crash recovery. Sequential append — very fast.
- SSTable flush: when the memtable reaches a size threshold (e.g., 64MB), it is flushed to disk as an immutable Sorted String Table (SSTable). SSTables are sorted by key and have a Bloom filter header for fast miss detection.
- Compaction: background job merges SSTables, removing overwritten values and expired TTL entries. Reduces read amplification.
Read Path
- Check memtable (O(log N)).
- Check Bloom filter of each SSTable level to skip levels that cannot contain the key.
- Binary search within SSTable (sorted file, so O(log N) per level).
- Return the most recent version found (SSTables are checked newest-first).
Read amplification: in the worst case, must check all SSTable levels. Compaction reduces level count. Bloom filters make most negative lookups O(1).
LSM-Tree vs B-Tree
- LSM-tree: better write throughput (sequential appends), higher read latency under load, needs compaction tuning.
- B-tree: better read latency (single tree traversal), lower write throughput (random in-place updates), simpler read path.
Key-value stores serving high-write workloads (time-series, event logging, caching backends) almost universally prefer LSM-tree.
Consistent Hashing for Partitioning
Consistent hashing assigns each key to a node on a conceptual ring:
- Hash each node's ID to a position on a 0 to 2^32-1 ring.
- Hash each key to a position on the same ring.
- Assign the key to the first node encountered clockwise from the key's hash position.
When a node is added or removed, only keys adjacent to that node's ring position are remapped — typically 1/N of all keys, not a full reshuffle.
Virtual Nodes
A physical node maps to V virtual node positions on the ring (V = 150-200 per node is typical). This distributes load more evenly and allows nodes with more capacity to hold more virtual nodes. When a physical node fails, its virtual nodes' keys are distributed across the remaining ring segments evenly rather than all landing on one neighbor.
Replication: N, W, R Quorum
The store replicates each key to N nodes (the replication factor, typically 3):
- The coordinator routes writes to the N nodes responsible for the key's ring segment (the preference list).
- W (write quorum): the coordinator waits for W acknowledgments before returning success. W=2 with N=3 ensures at least majority confirmation.
- R (read quorum): the coordinator reads from R nodes and returns the most recent version. R=2 with N=3.
- W + R > N guarantees that read and write quorums overlap, ensuring at least one node in the read set has the latest write.
Common configuration: N=3, W=2, R=2. This tolerates one node failure on both reads and writes while maintaining strong consistency.
For high-availability at the cost of consistency: W=1, R=1 (last-write-wins on conflict). For strong consistency: W=3, R=1 (writes must reach all nodes).
Vector Clocks for Conflict Resolution
When two writes to the same key arrive at different replicas concurrently (no happens-before relationship), the store has conflicting versions.
A vector clock is a list of (node_id, counter) pairs. On each write, the writing node increments its own counter:
node_A writes key K: clock = [(A, 1)]
node_B concurrently writes key K: clock = [(B, 1)]
-- Both clocks are incomparable: conflict
The store preserves both versions as siblings. On the next read, the client (or the store itself) must resolve the conflict — common strategies:
- Last-write-wins (LWW): use wall clock timestamp to pick the latest write. Simple but loses the other write.
- Client-side merge: return both siblings to the client, which applies application-specific merge logic (e.g., CRDT union for sets).
- Server-side CRDT: the value type itself is a CRDT (counter, set, register) and merges are always correct.
Read-Repair
When a quorum read fetches R responses and detects that some replicas are stale (lower vector clock version), it:
- Returns the most recent version to the client immediately.
- Asynchronously sends the latest version to the stale replica to repair it.
Read-repair is triggered by every read and keeps replicas converging toward consistency without a dedicated sync process. It is most effective when read traffic is high and covers all keys. For cold keys, anti-entropy is needed.
Anti-Entropy with Merkle Trees
For keys that are rarely or never read, read-repair never fires and replicas can diverge. Anti-entropy uses Merkle trees for efficient replica comparison:
- Each replica computes a Merkle tree over its key space. Leaf nodes are hashes of individual key-value pairs. Internal nodes hash their children.
- Replicas exchange root hashes. If roots match, replicas are in sync — no data transfer needed.
- If roots differ, recursively compare child nodes to identify the differing subtrees (key ranges).
- Sync only the differing key-value pairs. This bounds data transfer to O(diff size) rather than O(total data).
Anti-entropy runs as a background process on a configurable interval (e.g., every hour) between each pair of replica nodes.
TTL and Key Expiry
Per-key TTL is stored alongside the value. On read, if current_time > expiry, return key-not-found and write a tombstone. During compaction, the compaction process discards entries with expired TTL. Tombstones are propagated to replicas to prevent expired keys from being resurrected on replica sync.
SQL Schema (Metadata Layer)
-- Consistent hash ring configuration
CREATE TABLE PartitionConfig (
ring_position BIGINT PRIMARY KEY, -- position on the 2^32 ring
node_id VARCHAR(64) NOT NULL,
virtual_node INT NOT NULL,
is_active BOOLEAN NOT NULL DEFAULT TRUE
);
CREATE INDEX idx_partconfig_node ON PartitionConfig(node_id);
-- Replication group assignments
CREATE TABLE ReplicationGroup (
key_hash_start BIGINT NOT NULL,
key_hash_end BIGINT NOT NULL,
replica_nodes JSONB NOT NULL, -- ["node_1", "node_2", "node_3"]
PRIMARY KEY (key_hash_start, key_hash_end)
);
Python Implementation
import hashlib
import time
import json
from bisect import bisect_left, insort
from typing import Optional, List, Dict, Tuple
class VectorClock:
def __init__(self, clock: Optional[Dict[str, int]] = None):
self.clock: Dict[str, int] = clock or {}
def increment(self, node_id: str) -> "VectorClock":
new_clock = dict(self.clock)
new_clock[node_id] = new_clock.get(node_id, 0) + 1
return VectorClock(new_clock)
def dominates(self, other: "VectorClock") -> bool:
"""Return True if self is strictly newer than other."""
return all(self.clock.get(n, 0) >= other.clock.get(n, 0)
for n in set(self.clock) | set(other.clock)) and
any(self.clock.get(n, 0) > other.clock.get(n, 0)
for n in set(self.clock) | set(other.clock))
def concurrent(self, other: "VectorClock") -> bool:
"""Return True if neither clock dominates — conflict."""
return not self.dominates(other) and not other.dominates(self)
def to_dict(self) -> dict:
return self.clock
class ConsistentHashRing:
def __init__(self, virtual_nodes: int = 150):
self.virtual_nodes = virtual_nodes
self.ring: List[int] = [] # sorted hash positions
self.node_map: Dict[int, str] = {} # hash_position -> node_id
def add_node(self, node_id: str) -> None:
for i in range(self.virtual_nodes):
key = f"{node_id}:vnode:{i}"
pos = int(hashlib.md5(key.encode()).hexdigest(), 16) % (2**32)
insort(self.ring, pos)
self.node_map[pos] = node_id
def remove_node(self, node_id: str) -> None:
for i in range(self.virtual_nodes):
key = f"{node_id}:vnode:{i}"
pos = int(hashlib.md5(key.encode()).hexdigest(), 16) % (2**32)
if pos in self.node_map:
self.ring.remove(pos)
del self.node_map[pos]
def get_nodes(self, key: str, n: int = 3) -> List[str]:
"""Return the N nodes responsible for the key (preference list)."""
if not self.ring:
return []
pos = int(hashlib.md5(key.encode()).hexdigest(), 16) % (2**32)
idx = bisect_left(self.ring, pos) % len(self.ring)
result = []
seen_physical = set()
for i in range(len(self.ring)):
node_id = self.node_map[self.ring[(idx + i) % len(self.ring)]]
if node_id not in seen_physical:
seen_physical.add(node_id)
result.append(node_id)
if len(result) == n:
break
return result
class KVStore:
def __init__(self, node_id: str, ring: ConsistentHashRing,
n: int = 3, w: int = 2, r: int = 2):
self.node_id = node_id
self.ring = ring
self.N, self.W, self.R = n, w, r
self.local_store: dict = {} # key -> (value, VectorClock, expiry)
def put(self, key: str, value: bytes, ttl: Optional[int] = None) -> bool:
"""Write key-value with optional TTL (seconds). Returns True on quorum ack."""
replicas = self.ring.get_nodes(key, self.N)
expiry = time.time() + ttl if ttl else None
acks = self.replicate_write(key, value, expiry, replicas)
return acks >= self.W
def get(self, key: str) -> Optional[bytes]:
"""Read key from R replicas. Apply read-repair if versions differ."""
replicas = self.ring.get_nodes(key, self.N)
responses = self.gather_read(key, replicas, self.R)
if not responses:
return None
best_value, best_clock = self.resolve_responses(responses)
# Trigger read-repair asynchronously for stale replicas
self.read_repair(key, best_value, best_clock, replicas, responses)
return best_value
def delete(self, key: str) -> bool:
"""Delete by writing a tombstone."""
return self.put(key, b'__TOMBSTONE__', ttl=86400) # TTL ensures tombstone cleanup
def replicate_write(self, key: str, value: bytes,
expiry: Optional[float], replicas: List[str]) -> int:
"""Send write to replicas; return count of acks received."""
new_clock = self.get_current_clock(key).increment(self.node_id)
acks = 0
for replica in replicas:
if self._send_write(replica, key, value, new_clock, expiry):
acks += 1
return acks
def _send_write(self, node_id: str, key: str, value: bytes,
clock: VectorClock, expiry: Optional[float]) -> bool:
"""Write to local store if this node, else RPC to remote node."""
if node_id == self.node_id:
self.local_store[key] = (value, clock, expiry)
return True
try:
rpc_client.write(node_id, key, value, clock.to_dict(), expiry)
return True
except Exception:
return False
def gather_read(self, key: str, replicas: List[str],
r: int) -> List[Tuple[bytes, VectorClock]]:
"""Fetch from R replicas, return list of (value, clock) responses."""
responses = []
for replica in replicas:
if len(responses) >= r:
break
try:
val, clock_dict, expiry = rpc_client.read(replica, key)
if expiry and time.time() > expiry:
continue # expired
if val != b'__TOMBSTONE__':
responses.append((val, VectorClock(clock_dict)))
except Exception:
continue
return responses
def resolve_responses(self, responses: List[Tuple[bytes, VectorClock]]) -> Tuple:
"""Return the response with the highest (most recent) vector clock."""
best = responses[0]
for val, clock in responses[1:]:
if clock.dominates(best[1]):
best = (val, clock)
return best
def read_repair(self, key: str, best_value: bytes,
best_clock: VectorClock,
replicas: List[str],
responses: List[Tuple]) -> None:
"""Asynchronously repair stale replicas."""
latest_clocks = {r[1] for r in responses}
for replica in replicas:
if replica == self.node_id:
stored = self.local_store.get(key)
if stored and best_clock.dominates(stored[1]):
self.local_store[key] = (best_value, best_clock, stored[2])
# In production: send async RPC to stale replicas
def get_current_clock(self, key: str) -> VectorClock:
entry = self.local_store.get(key)
return entry[1] if entry else VectorClock()
Key Design Decisions Summary
- LSM-tree provides high write throughput via sequential WAL appends and deferred compaction, at the cost of read amplification.
- Virtual nodes (150+ per physical node) ensure even load distribution and graceful rebalancing on node failure or addition.
- W+R>N guarantees that at least one node in every read quorum has seen the latest write — the foundation of tunable consistency.
- Vector clocks detect concurrent writes precisely; LWW is simpler but discards one write silently.
- Merkle tree anti-entropy bounds the sync cost to O(diff) and is essential for ensuring cold keys eventually converge across replicas.
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering
See also: Anthropic Interview Guide 2026: Process, Questions, and AI Safety
See also: Databricks Interview Guide 2026: Spark Internals, Delta Lake, and Lakehouse Architecture