What Is Eventual Consistency?
Eventual consistency is a consistency model that guarantees: if no new updates are made to a data item, all replicas will eventually converge to the same value. It makes no guarantee about when convergence happens or what value is returned between updates. It is the weakest consistency model commonly used in production distributed systems.
Eventual consistency enables high availability and low latency because replicas can serve reads and accept writes without coordinating synchronously with other nodes. The cost is that different clients may see different values for the same key at the same point in time.
Convergence Mechanisms
Three primary mechanisms drive convergence:
- Gossip propagation: nodes periodically exchange state with random peers. Updates spread through the cluster like an epidemic. Convergence time is O(log N) rounds for N nodes.
- Anti-entropy with Merkle trees: nodes build a Merkle tree over their key space. By exchanging tree hashes, two nodes can efficiently identify which key ranges differ and synchronize only the diverged data.
- Hinted handoff: if a target replica is temporarily unavailable, a coordinator node stores a hint (the write) and delivers it when the replica recovers.
Conflict Resolution Strategies
When replicas diverge (concurrent writes to the same key on different nodes), a conflict resolution strategy determines the winner:
- Last-Write-Wins (LWW): the write with the highest timestamp wins. Simple and widely used. Risk: clock skew can cause a newer write (by wall clock) to be overwritten by an older one. NTP synchronization reduces but does not eliminate this risk.
- Vector clocks: each write carries a vector clock (a per-node logical counter). The system detects causally concurrent writes (neither happens-before the other) and surfaces them as conflicts. Application code merges the conflicting values. Used in Amazon Dynamo and Riak.
- CRDTs (Conflict-free Replicated Data Types): data structures designed to merge without conflict. Grow-only counters (G-Counter), add-only sets (G-Set), and two-phase sets (2P-Set) are common examples. No application-level merge needed.
- Application-level merge: for shopping carts, merge = union of items across conflicting versions. Semantic knowledge drives the merge.
Anti-Entropy
Anti-entropy is a background process that periodically reconciles state between pairs of replicas. Using Merkle trees:
- Each replica builds a Merkle tree over its key range. Leaf nodes are hashes of individual key-value pairs. Interior nodes are hashes of children.
- Two replicas exchange root hashes. If roots match, they are identical — no work needed.
- If roots differ, traverse the tree to find the differing subtrees. Only keys in differing subtrees are exchanged and repaired.
Anti-entropy ensures convergence even when gossip propagation fails or when a replica has been offline for an extended period.
Hinted Handoff
When a write arrives and the target replica is unavailable:
- A coordinator node accepts the write and stores it as a hint (key, value, target node ID).
- The coordinator periodically checks if the target node has recovered.
- When the target recovers, the coordinator delivers all pending hints.
- Hints have a TTL — if the target is unavailable too long, the hint expires and the write must be recovered via anti-entropy instead.
Read Repair
On a quorum read, the coordinator collects responses from multiple replicas. If replicas return different values:
- Return the most recent value (by timestamp or vector clock) to the client immediately.
- Asynchronously write the correct value back to the stale replicas.
Read repair is a lightweight, on-demand reconciliation mechanism. It only activates for keys that are actually read, making it efficient for hot keys.
Tunable Consistency
Systems like Cassandra allow per-operation consistency levels:
- ONE: one replica responds. Lowest latency, highest staleness risk.
- QUORUM: majority of replicas respond. Balanced — reads and writes at QUORUM guarantee overlap (no stale reads after a QUORUM write).
- ALL: all replicas respond. Highest consistency, highest latency, unavailable if any replica is down.
SQL Schema
CREATE TABLE ReplicaState (
node_id TEXT NOT NULL,
key TEXT NOT NULL,
value JSONB NOT NULL,
vector_clock JSONB NOT NULL DEFAULT '{}'::jsonb,
updated_at TIMESTAMPTZ NOT NULL DEFAULT now(),
PRIMARY KEY (node_id, key)
);
CREATE TABLE HintedHandoff (
hint_id BIGSERIAL PRIMARY KEY,
target_node TEXT NOT NULL,
key TEXT NOT NULL,
value JSONB NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
delivered_at TIMESTAMPTZ,
expires_at TIMESTAMPTZ NOT NULL
);
CREATE INDEX idx_hints_target ON HintedHandoff (target_node, delivered_at NULLS FIRST);
CREATE TABLE ReadRepairLog (
id BIGSERIAL PRIMARY KEY,
key TEXT NOT NULL,
stale_node_id TEXT NOT NULL,
repaired_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
Python Implementation Sketch
import time, random
from typing import List, Optional
class EventuallyConsistentStore:
def __init__(self, nodes: list, quorum: int = 2):
self.nodes = nodes
self.quorum = quorum
def write(self, key: str, value: dict, consistency_level: str = 'QUORUM') -> bool:
required = self._required_acks(consistency_level)
acks = 0
for node in self.nodes:
try:
node.put(key, value, timestamp=time.time())
acks += 1
if acks >= required:
return True
except Exception:
continue
return acks >= required
def read(self, key: str, consistency_level: str = 'QUORUM') -> Optional[dict]:
required = self._required_acks(consistency_level)
responses = []
for node in self.nodes:
try:
response = node.get(key)
if response:
responses.append(response)
if len(responses) >= required:
break
except Exception:
continue
if not responses:
return None
result = self.perform_read_repair(key, responses)
return result
def perform_read_repair(self, key: str, responses: list) -> dict:
latest = max(responses, key=lambda r: r.get('timestamp', 0))
stale = [r for r in responses if r.get('timestamp', 0) int:
if level == 'ONE': return 1
if level == 'QUORUM': return self.quorum
if level == 'ALL': return len(self.nodes)
return self.quorum
def _resolve_conflict(self, a: dict, b: dict) -> dict:
# Last-write-wins by timestamp
return a if a.get('timestamp', 0) >= b.get('timestamp', 0) else b
def _build_merkle_tree(self): pass
def _diff_merkle_trees(self, a, b): return []
def _find_node(self, node_id: str): return None
def _local_get(self, key: str): return {}
def _local_put(self, key: str, value: dict): pass
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “When should you use LWW vs vector clocks for conflict resolution?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Use LWW when the data type is naturally overwrite-friendly (user profile, configuration value) and clock skew is manageable (NTP-synced nodes within a datacenter). Use vector clocks when causal history matters and you need to detect concurrent writes — for example, collaborative documents or shopping carts where merging conflicting versions is semantically meaningful. LWW is simpler; vector clocks are safer for data where silent overwrites cause correctness bugs.”
}
},
{
“@type”: “Question”,
“name”: “How does hinted handoff ensure eventual delivery?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “When a target replica is unavailable, a coordinator stores the write as a hint with a target node ID and an expiry TTL. The coordinator periodically checks if the target has recovered and delivers all pending hints when it does. If the hint expires (target down too long), the write must be recovered via anti-entropy by comparing Merkle trees with the recovered replica. Hinted handoff provides fast, best-effort delivery; anti-entropy provides the safety net.”
}
},
{
“@type”: “Question”,
“name”: “What is the overhead of read repair in an eventually consistent system?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Read repair adds a background write for each stale replica detected on a quorum read. The client read itself is not delayed — the repair is asynchronous. The overhead is proportional to the number of stale replicas and the frequency of reads on diverged keys. In practice, read repair converges frequently-read hot keys quickly while rarely-read cold keys rely on anti-entropy for eventual synchronization.”
}
},
{
“@type”: “Question”,
“name”: “How do tunable consistency levels work in practice?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “In a 3-node cluster with replication factor 3, QUORUM requires 2 replicas. A write at QUORUM and a subsequent read at QUORUM are guaranteed to overlap on at least one replica, ensuring the read sees the write. ONE offers the lowest latency but allows stale reads. ALL offers linearizable-like behavior but is unavailable if any node is down. Most production workloads use QUORUM for writes and ONE or QUORUM for reads depending on the staleness tolerance of the use case.”
}
}
]
}
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How does read repair achieve convergence?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “During a read, the coordinator queries multiple replicas and compares their returned values; any replica that returns a stale version is asynchronously sent the most recent value so that it converges to the latest state without a separate repair job. Read repair is triggered probabilistically (e.g., on 10% of reads) to avoid adding latency to every request, balancing convergence speed against read overhead.”
}
},
{
“@type”: “Question”,
“name”: “How are conflicts detected in eventually consistent systems?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Conflicts are detected by comparing vector clocks or version vectors attached to each value: if neither version's clock dominates the other, the two versions are concurrent and represent a conflict that must be resolved. Systems like Dynamo surface sibling values to the application layer for semantic merge, while simpler systems apply last-write-wins using a wall-clock or logical timestamp.”
}
},
{
“@type”: “Question”,
“name”: “What is anti-entropy and how does it work?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Anti-entropy is a background gossip or Merkle-tree exchange process where pairs of nodes periodically compare their data sets and synchronize missing or divergent entries, ensuring that temporary network partitions do not leave replicas permanently inconsistent. A Merkle tree hashes ranges of key space into a tree structure so that two nodes can identify the differing subtree with O(log n) round trips rather than exchanging the full data set.”
}
},
{
“@type”: “Question”,
“name”: “How is eventual consistency bounded in practice?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Eventual consistency is bounded by characterizing the maximum propagation delay under normal network conditions and expressing it as a staleness SLA (e.g., all replicas converge within 500 ms under normal operation). Operators monitor replication lag histograms and anti-entropy completion times to confirm the system stays within the bound, and use session consistency (read-your-writes via sticky routing or tokens) to hide staleness from individual user sessions.”
}
}
]
}
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering
See also: Meta Interview Guide 2026: Facebook, Instagram, WhatsApp Engineering
See also: Databricks Interview Guide 2026: Spark Internals, Delta Lake, and Lakehouse Architecture