What are the tradeoffs between LSM-tree and B-tree storage engines?

LSM-tree excels at write throughput: all writes are sequential (WAL + memtable), avoiding random disk seeks. Reads are more expensive because multiple SSTable levels must be checked. B-tree has better read performance (single tree traversal to the leaf) but slower writes due to in-place updates requiring random writes. LSM-tree is preferred for key-value stores with high write rates (time-series, event logging). B-tree is preferred for read-heavy workloads (OLTP databases like PostgreSQL, MySQL).

How does the quorum math work in a distributed key-value store?

With replication factor N, write quorum W, and read quorum R, the condition W + R > N ensures that the sets of nodes that acknowledge a write and the nodes queried for a read always overlap by at least one node. That overlapping node has the latest write, so every read sees the most recent value. With N=3, W=2, R=2: any two-node write set and any two-node read set must share at least one node (since 2+2 > 3). Setting W=1, R=1 allows higher availability but risks reading stale data.

How do vector clocks detect conflicts in a distributed key-value store?

A vector clock is a map of node_id to counter. Each write increments the writing node's counter. Two versions are in a happens-before relationship if one clock dominates (all counters are greater than or equal, with at least one strictly greater). If neither dominates the other, the writes are concurrent — a conflict. The store preserves both as siblings. Resolution strategies include last-write-wins (using timestamps), client-side merge, or automatic CRDT merge.

How does anti-entropy with Merkle trees work between replicas?

Each replica builds a Merkle tree over its keyspace: leaf nodes hash individual key-value pairs, and each parent node hashes its children. Two replicas compare their root hashes — if they match, the replicas are in sync. If they differ, the nodes recursively compare child hashes to identify which key ranges differ, narrowing down to only the changed keys. This makes anti-entropy bandwidth proportional to the number of differing keys rather than total data size, which is critical for large datasets with small diffs.

Why use an LSM-tree instead of a B-tree for a key-value store?

LSM-tree converts random writes into sequential appends to the memtable and WAL, achieving much higher write throughput; B-trees perform random I/O on updates which saturates disk bandwidth; LSM trades read amplification for write performance.

How does quorum-based replication ensure consistency?

With N=3 replicas, W=2 writes, R=2 reads: W+R=4>N=3 guarantees overlap — at least one replica in any read quorum has the latest write, ensuring strong consistency.

How do vector clocks detect concurrent writes?

Each replica maintains a per-node counter; a write increments the local counter; two versions are concurrent if neither vector dominates the other (neither is component-wise greater); concurrent versions require application-level or LWW conflict resolution.

How does Merkle tree anti-entropy detect and repair diverged replicas?

Each replica builds a Merkle tree of its key ranges; two replicas exchange tree roots and recursively compare subtrees to find diverged ranges; only the diverged key ranges are synchronized, minimizing data transfer.

Key-Value Store Low-Level Design: Storage Engine, Partitioning, Replication, and Consistency

⏱ 12 min read

Key-Value Store System: Overview

A key-value store is the simplest distributed database: keys map to opaque binary values. Despite its simplicity, designing one at low level requires careful choices about storage engine, hashing-based partitioning, replication topology, conflict resolution, and replica synchronization. This post covers each layer end-to-end.

Storage Engine: LSM-Tree

The Log-Structured Merge-tree (LSM-tree) is the dominant storage engine for write-heavy key-value stores (used by LevelDB, RocksDB, Cassandra).

Write Path

Memtable: writes land in an in-memory sorted structure (red-black tree or skip list). O(log N) per write.
WAL (Write-Ahead Log): simultaneously written to disk for crash recovery. Sequential append — very fast.
SSTable flush: when the memtable reaches a size threshold (e.g., 64MB), it is flushed to disk as an immutable Sorted String Table (SSTable). SSTables are sorted by key and have a Bloom filter header for fast miss detection.
Compaction: background job merges SSTables, removing overwritten values and expired TTL entries. Reduces read amplification.

Read Path

Check memtable (O(log N)).
Check Bloom filter of each SSTable level to skip levels that cannot contain the key.
Binary search within SSTable (sorted file, so O(log N) per level).
Return the most recent version found (SSTables are checked newest-first).

Read amplification: in the worst case, must check all SSTable levels. Compaction reduces level count. Bloom filters make most negative lookups O(1).

LSM-Tree vs B-Tree

LSM-tree: better write throughput (sequential appends), higher read latency under load, needs compaction tuning.
B-tree: better read latency (single tree traversal), lower write throughput (random in-place updates), simpler read path.

Key-value stores serving high-write workloads (time-series, event logging, caching backends) almost universally prefer LSM-tree.

Consistent Hashing for Partitioning

Consistent hashing assigns each key to a node on a conceptual ring:

Hash each node's ID to a position on a 0 to 2^32-1 ring.
Hash each key to a position on the same ring.
Assign the key to the first node encountered clockwise from the key's hash position.

When a node is added or removed, only keys adjacent to that node's ring position are remapped — typically 1/N of all keys, not a full reshuffle.

Virtual Nodes

A physical node maps to V virtual node positions on the ring (V = 150-200 per node is typical). This distributes load more evenly and allows nodes with more capacity to hold more virtual nodes. When a physical node fails, its virtual nodes' keys are distributed across the remaining ring segments evenly rather than all landing on one neighbor.

Replication: N, W, R Quorum

The store replicates each key to N nodes (the replication factor, typically 3):

The coordinator routes writes to the N nodes responsible for the key's ring segment (the preference list).
W (write quorum): the coordinator waits for W acknowledgments before returning success. W=2 with N=3 ensures at least majority confirmation.
R (read quorum): the coordinator reads from R nodes and returns the most recent version. R=2 with N=3.
W + R > N guarantees that read and write quorums overlap, ensuring at least one node in the read set has the latest write.

Common configuration: N=3, W=2, R=2. This tolerates one node failure on both reads and writes while maintaining strong consistency.

For high-availability at the cost of consistency: W=1, R=1 (last-write-wins on conflict). For strong consistency: W=3, R=1 (writes must reach all nodes).

Vector Clocks for Conflict Resolution

When two writes to the same key arrive at different replicas concurrently (no happens-before relationship), the store has conflicting versions.

A vector clock is a list of (node_id, counter) pairs. On each write, the writing node increments its own counter:

node_A writes key K: clock = [(A, 1)]
node_B concurrently writes key K: clock = [(B, 1)]
-- Both clocks are incomparable: conflict

The store preserves both versions as siblings. On the next read, the client (or the store itself) must resolve the conflict — common strategies:

Last-write-wins (LWW): use wall clock timestamp to pick the latest write. Simple but loses the other write.
Client-side merge: return both siblings to the client, which applies application-specific merge logic (e.g., CRDT union for sets).
Server-side CRDT: the value type itself is a CRDT (counter, set, register) and merges are always correct.

Read-Repair

When a quorum read fetches R responses and detects that some replicas are stale (lower vector clock version), it:

Returns the most recent version to the client immediately.
Asynchronously sends the latest version to the stale replica to repair it.

Read-repair is triggered by every read and keeps replicas converging toward consistency without a dedicated sync process. It is most effective when read traffic is high and covers all keys. For cold keys, anti-entropy is needed.

Anti-Entropy with Merkle Trees

For keys that are rarely or never read, read-repair never fires and replicas can diverge. Anti-entropy uses Merkle trees for efficient replica comparison:

Each replica computes a Merkle tree over its key space. Leaf nodes are hashes of individual key-value pairs. Internal nodes hash their children.
Replicas exchange root hashes. If roots match, replicas are in sync — no data transfer needed.
If roots differ, recursively compare child nodes to identify the differing subtrees (key ranges).
Sync only the differing key-value pairs. This bounds data transfer to O(diff size) rather than O(total data).

Anti-entropy runs as a background process on a configurable interval (e.g., every hour) between each pair of replica nodes.

TTL and Key Expiry

Per-key TTL is stored alongside the value. On read, if current_time > expiry, return key-not-found and write a tombstone. During compaction, the compaction process discards entries with expired TTL. Tombstones are propagated to replicas to prevent expired keys from being resurrected on replica sync.

SQL Schema (Metadata Layer)

-- Consistent hash ring configuration
CREATE TABLE PartitionConfig (
    ring_position  BIGINT PRIMARY KEY,   -- position on the 2^32 ring
    node_id        VARCHAR(64) NOT NULL,
    virtual_node   INT NOT NULL,
    is_active      BOOLEAN NOT NULL DEFAULT TRUE
);
CREATE INDEX idx_partconfig_node ON PartitionConfig(node_id);

-- Replication group assignments
CREATE TABLE ReplicationGroup (
    key_hash_start  BIGINT NOT NULL,
    key_hash_end    BIGINT NOT NULL,
    replica_nodes   JSONB NOT NULL,   -- ["node_1", "node_2", "node_3"]
    PRIMARY KEY (key_hash_start, key_hash_end)
);

Python Implementation

import hashlib
import time
import json
from bisect import bisect_left, insort
from typing import Optional, List, Dict, Tuple

class VectorClock:
    def __init__(self, clock: Optional[Dict[str, int]] = None):
        self.clock: Dict[str, int] = clock or {}

    def increment(self, node_id: str) -> "VectorClock":
        new_clock = dict(self.clock)
        new_clock[node_id] = new_clock.get(node_id, 0) + 1
        return VectorClock(new_clock)

    def dominates(self, other: "VectorClock") -> bool:
        """Return True if self is strictly newer than other."""
        return all(self.clock.get(n, 0) >= other.clock.get(n, 0)
                   for n in set(self.clock) | set(other.clock)) and 
               any(self.clock.get(n, 0) > other.clock.get(n, 0)
                   for n in set(self.clock) | set(other.clock))

    def concurrent(self, other: "VectorClock") -> bool:
        """Return True if neither clock dominates — conflict."""
        return not self.dominates(other) and not other.dominates(self)

    def to_dict(self) -> dict:
        return self.clock

class ConsistentHashRing:
    def __init__(self, virtual_nodes: int = 150):
        self.virtual_nodes = virtual_nodes
        self.ring: List[int] = []           # sorted hash positions
        self.node_map: Dict[int, str] = {}  # hash_position -> node_id

    def add_node(self, node_id: str) -> None:
        for i in range(self.virtual_nodes):
            key = f"{node_id}:vnode:{i}"
            pos = int(hashlib.md5(key.encode()).hexdigest(), 16) % (2**32)
            insort(self.ring, pos)
            self.node_map[pos] = node_id

    def remove_node(self, node_id: str) -> None:
        for i in range(self.virtual_nodes):
            key = f"{node_id}:vnode:{i}"
            pos = int(hashlib.md5(key.encode()).hexdigest(), 16) % (2**32)
            if pos in self.node_map:
                self.ring.remove(pos)
                del self.node_map[pos]

    def get_nodes(self, key: str, n: int = 3) -> List[str]:
        """Return the N nodes responsible for the key (preference list)."""
        if not self.ring:
            return []
        pos = int(hashlib.md5(key.encode()).hexdigest(), 16) % (2**32)
        idx = bisect_left(self.ring, pos) % len(self.ring)
        result = []
        seen_physical = set()
        for i in range(len(self.ring)):
            node_id = self.node_map[self.ring[(idx + i) % len(self.ring)]]
            if node_id not in seen_physical:
                seen_physical.add(node_id)
                result.append(node_id)
            if len(result) == n:
                break
        return result

class KVStore:
    def __init__(self, node_id: str, ring: ConsistentHashRing,
                 n: int = 3, w: int = 2, r: int = 2):
        self.node_id = node_id
        self.ring = ring
        self.N, self.W, self.R = n, w, r
        self.local_store: dict = {}  # key -> (value, VectorClock, expiry)

    def put(self, key: str, value: bytes, ttl: Optional[int] = None) -> bool:
        """Write key-value with optional TTL (seconds). Returns True on quorum ack."""
        replicas = self.ring.get_nodes(key, self.N)
        expiry = time.time() + ttl if ttl else None
        acks = self.replicate_write(key, value, expiry, replicas)
        return acks >= self.W

    def get(self, key: str) -> Optional[bytes]:
        """Read key from R replicas. Apply read-repair if versions differ."""
        replicas = self.ring.get_nodes(key, self.N)
        responses = self.gather_read(key, replicas, self.R)
        if not responses:
            return None
        best_value, best_clock = self.resolve_responses(responses)
        # Trigger read-repair asynchronously for stale replicas
        self.read_repair(key, best_value, best_clock, replicas, responses)
        return best_value

    def delete(self, key: str) -> bool:
        """Delete by writing a tombstone."""
        return self.put(key, b'__TOMBSTONE__', ttl=86400)  # TTL ensures tombstone cleanup

    def replicate_write(self, key: str, value: bytes,
                        expiry: Optional[float], replicas: List[str]) -> int:
        """Send write to replicas; return count of acks received."""
        new_clock = self.get_current_clock(key).increment(self.node_id)
        acks = 0
        for replica in replicas:
            if self._send_write(replica, key, value, new_clock, expiry):
                acks += 1
        return acks

    def _send_write(self, node_id: str, key: str, value: bytes,
                    clock: VectorClock, expiry: Optional[float]) -> bool:
        """Write to local store if this node, else RPC to remote node."""
        if node_id == self.node_id:
            self.local_store[key] = (value, clock, expiry)
            return True
        try:
            rpc_client.write(node_id, key, value, clock.to_dict(), expiry)
            return True
        except Exception:
            return False

    def gather_read(self, key: str, replicas: List[str],
                    r: int) -> List[Tuple[bytes, VectorClock]]:
        """Fetch from R replicas, return list of (value, clock) responses."""
        responses = []
        for replica in replicas:
            if len(responses) >= r:
                break
            try:
                val, clock_dict, expiry = rpc_client.read(replica, key)
                if expiry and time.time() > expiry:
                    continue  # expired
                if val != b'__TOMBSTONE__':
                    responses.append((val, VectorClock(clock_dict)))
            except Exception:
                continue
        return responses

    def resolve_responses(self, responses: List[Tuple[bytes, VectorClock]]) -> Tuple:
        """Return the response with the highest (most recent) vector clock."""
        best = responses[0]
        for val, clock in responses[1:]:
            if clock.dominates(best[1]):
                best = (val, clock)
        return best

    def read_repair(self, key: str, best_value: bytes,
                    best_clock: VectorClock,
                    replicas: List[str],
                    responses: List[Tuple]) -> None:
        """Asynchronously repair stale replicas."""
        latest_clocks = {r[1] for r in responses}
        for replica in replicas:
            if replica == self.node_id:
                stored = self.local_store.get(key)
                if stored and best_clock.dominates(stored[1]):
                    self.local_store[key] = (best_value, best_clock, stored[2])
            # In production: send async RPC to stale replicas

    def get_current_clock(self, key: str) -> VectorClock:
        entry = self.local_store.get(key)
        return entry[1] if entry else VectorClock()

Key Design Decisions Summary

LSM-tree provides high write throughput via sequential WAL appends and deferred compaction, at the cost of read amplification.
Virtual nodes (150+ per physical node) ensure even load distribution and graceful rebalancing on node failure or addition.
W+R>N guarantees that at least one node in every read quorum has seen the latest write — the foundation of tunable consistency.
Vector clocks detect concurrent writes precisely; LWW is simpler but discards one write silently.
Merkle tree anti-entropy bounds the sync cost to O(diff) and is essential for ensuring cold keys eventually converge across replicas.