Question 1

How does an LSM-tree storage engine work?

Accepted Answer

An LSM-tree (Log-Structured Merge tree) optimizes for write throughput by converting random writes to sequential disk I/O. Architecture: (1) All writes go to an in-memory sorted structure (memtable) backed by a write-ahead log (WAL) for durability. (2) When the memtable reaches a size threshold (~64MB), it is flushed to an immutable SSTable file on disk — sorted by key, sequential write. (3) SSTables accumulate in levels. A compaction process periodically merges and sorts overlapping SSTables, removing deleted/outdated values. (4) Reads check the memtable first (newest data), then SSTable levels from newest to oldest. Bloom filters on each SSTable skip files that definitely don't contain the key. LSM-trees excel at write-heavy workloads (time-series data, event logs, Cassandra, RocksDB); B-trees excel at read-heavy, range-query workloads.

Question 2

How does quorum-based replication work in a key-value store?

Accepted Answer

With N replicas, quorum reads and writes require W write acknowledgments and R read responses, where W+R>N guarantees at least one node with the latest write is included in every read. Common configuration: N=3, W=2, R=2. Write quorum: the client sends the write to all 3 replicas; wait for 2 to acknowledge before returning success. If a third replica is down, writes still succeed. Read quorum: the client reads from 2 replicas; if they disagree, take the value with the higher timestamp (last-write-wins). If W=1, R=1 (Cassandra's default): low latency, eventual consistency — some reads may return stale values. If W=2, R=2: strong consistency, higher latency — every read sees the latest write. The trade-off: latency increases with higher quorum requirements; consistency improves.

Question 3

How do vector clocks resolve write conflicts in a distributed key-value store?

Accepted Answer

A vector clock is a map of node→counter attached to each write. When node A writes: increment A's counter in the vector clock, attach the clock to the value. When node B writes based on A's value: its clock is {A:1, B:1}. If A and B write concurrently (each starts from the same base state), they produce conflicting clocks: {A:2} and {B:1} — neither dominates (no clock has all values >= the other). On reconciliation, both versions are preserved and presented to the client. The client application (or a merge function) resolves the conflict by choosing one or merging both. Amazon Dynamo (the paper) uses vector clocks this way; each shopping cart conflict is shown to the customer who manually resolves it. Last-write-wins (LWW) is simpler but loses data when concurrent writes occur with identical timestamps due to clock skew.

Question 4

When should a key-value store be CP vs. AP under CAP theorem?

Accepted Answer

CAP theorem: during a network partition, choose either consistency (C — refuse operations to avoid stale reads) or availability (A — continue serving, accepting possible staleness). CP systems (etcd, ZooKeeper, HBase): during a partition, stop accepting writes unless a quorum is reachable. Appropriate for: distributed coordination (leader election, distributed locks), configuration management, anything where stale reads cause correctness problems. Example: if your service uses etcd to determine the current leader and etcd returns a stale value, you get split-brain. AP systems (Cassandra, DynamoDB with eventual consistency): during a partition, accept writes on all reachable nodes; resolve conflicts after the partition heals. Appropriate for: caches, session stores, user-generated content — where stale reads are acceptable and availability matters more than perfect consistency.

Key-Value Store: Low-Level Design

Storage Engine Options

Consistent Hashing for Distribution

Replication

Conflict Resolution

CAP Theorem Application