Question 1

When should you use LWW vs vector clocks for conflict resolution?

Accepted Answer

Use LWW when the data type is naturally overwrite-friendly (user profile, configuration value) and clock skew is manageable (NTP-synced nodes within a datacenter). Use vector clocks when causal history matters and you need to detect concurrent writes — for example, collaborative documents or shopping carts where merging conflicting versions is semantically meaningful. LWW is simpler; vector clocks are safer for data where silent overwrites cause correctness bugs.

Question 2

How does hinted handoff ensure eventual delivery?

Accepted Answer

When a target replica is unavailable, a coordinator stores the write as a hint with a target node ID and an expiry TTL. The coordinator periodically checks if the target has recovered and delivers all pending hints when it does. If the hint expires (target down too long), the write must be recovered via anti-entropy by comparing Merkle trees with the recovered replica. Hinted handoff provides fast, best-effort delivery; anti-entropy provides the safety net.

Question 3

What is the overhead of read repair in an eventually consistent system?

Accepted Answer

Read repair adds a background write for each stale replica detected on a quorum read. The client read itself is not delayed — the repair is asynchronous. The overhead is proportional to the number of stale replicas and the frequency of reads on diverged keys. In practice, read repair converges frequently-read hot keys quickly while rarely-read cold keys rely on anti-entropy for eventual synchronization.

Question 4

How do tunable consistency levels work in practice?

Accepted Answer

In a 3-node cluster with replication factor 3, QUORUM requires 2 replicas. A write at QUORUM and a subsequent read at QUORUM are guaranteed to overlap on at least one replica, ensuring the read sees the write. ONE offers the lowest latency but allows stale reads. ALL offers linearizable-like behavior but is unavailable if any node is down. Most production workloads use QUORUM for writes and ONE or QUORUM for reads depending on the staleness tolerance of the use case.

Question 5

How does read repair achieve convergence?

Accepted Answer

During a read, the coordinator queries multiple replicas and compares their returned values; any replica that returns a stale version is asynchronously sent the most recent value so that it converges to the latest state without a separate repair job. Read repair is triggered probabilistically (e.g., on 10% of reads) to avoid adding latency to every request, balancing convergence speed against read overhead.

Question 6

How are conflicts detected in eventually consistent systems?

Accepted Answer

Conflicts are detected by comparing vector clocks or version vectors attached to each value: if neither version's clock dominates the other, the two versions are concurrent and represent a conflict that must be resolved. Systems like Dynamo surface sibling values to the application layer for semantic merge, while simpler systems apply last-write-wins using a wall-clock or logical timestamp.

Question 7

What is anti-entropy and how does it work?

Accepted Answer

Anti-entropy is a background gossip or Merkle-tree exchange process where pairs of nodes periodically compare their data sets and synchronize missing or divergent entries, ensuring that temporary network partitions do not leave replicas permanently inconsistent. A Merkle tree hashes ranges of key space into a tree structure so that two nodes can identify the differing subtree with O(log n) round trips rather than exchanging the full data set.

Question 8

How is eventual consistency bounded in practice?

Accepted Answer

Eventual consistency is bounded by characterizing the maximum propagation delay under normal network conditions and expressing it as a staleness SLA (e.g., all replicas converge within 500 ms under normal operation). Operators monitor replication lag histograms and anti-entropy completion times to confirm the system stays within the bound, and use session consistency (read-your-writes via sticky routing or tokens) to hide staleness from individual user sessions.

Eventual Consistency Low-Level Design: Convergence Guarantees, Conflict Resolution, and Read Repair

What Is Eventual Consistency?

Convergence Mechanisms

Conflict Resolution Strategies

Anti-Entropy

Hinted Handoff

Read Repair

Tunable Consistency

SQL Schema

Python Implementation Sketch