What Is Eventual Consistency?
Eventual consistency is a consistency model that guarantees: if no new updates are made to a data item, all replicas will eventually converge to the same value. It makes no guarantee about when convergence happens or what value is returned between updates. It is the weakest consistency model commonly used in production distributed systems.
Eventual consistency enables high availability and low latency because replicas can serve reads and accept writes without coordinating synchronously with other nodes. The cost is that different clients may see different values for the same key at the same point in time.
Convergence Mechanisms
Three primary mechanisms drive convergence:
- Gossip propagation: nodes periodically exchange state with random peers. Updates spread through the cluster like an epidemic. Convergence time is O(log N) rounds for N nodes.
- Anti-entropy with Merkle trees: nodes build a Merkle tree over their key space. By exchanging tree hashes, two nodes can efficiently identify which key ranges differ and synchronize only the diverged data.
- Hinted handoff: if a target replica is temporarily unavailable, a coordinator node stores a hint (the write) and delivers it when the replica recovers.
Conflict Resolution Strategies
When replicas diverge (concurrent writes to the same key on different nodes), a conflict resolution strategy determines the winner:
- Last-Write-Wins (LWW): the write with the highest timestamp wins. Simple and widely used. Risk: clock skew can cause a newer write (by wall clock) to be overwritten by an older one. NTP synchronization reduces but does not eliminate this risk.
- Vector clocks: each write carries a vector clock (a per-node logical counter). The system detects causally concurrent writes (neither happens-before the other) and surfaces them as conflicts. Application code merges the conflicting values. Used in Amazon Dynamo and Riak.
- CRDTs (Conflict-free Replicated Data Types): data structures designed to merge without conflict. Grow-only counters (G-Counter), add-only sets (G-Set), and two-phase sets (2P-Set) are common examples. No application-level merge needed.
- Application-level merge: for shopping carts, merge = union of items across conflicting versions. Semantic knowledge drives the merge.
Anti-Entropy
Anti-entropy is a background process that periodically reconciles state between pairs of replicas. Using Merkle trees:
- Each replica builds a Merkle tree over its key range. Leaf nodes are hashes of individual key-value pairs. Interior nodes are hashes of children.
- Two replicas exchange root hashes. If roots match, they are identical — no work needed.
- If roots differ, traverse the tree to find the differing subtrees. Only keys in differing subtrees are exchanged and repaired.
Anti-entropy ensures convergence even when gossip propagation fails or when a replica has been offline for an extended period.
Hinted Handoff
When a write arrives and the target replica is unavailable:
- A coordinator node accepts the write and stores it as a hint (key, value, target node ID).
- The coordinator periodically checks if the target node has recovered.
- When the target recovers, the coordinator delivers all pending hints.
- Hints have a TTL — if the target is unavailable too long, the hint expires and the write must be recovered via anti-entropy instead.
Read Repair
On a quorum read, the coordinator collects responses from multiple replicas. If replicas return different values:
- Return the most recent value (by timestamp or vector clock) to the client immediately.
- Asynchronously write the correct value back to the stale replicas.
Read repair is a lightweight, on-demand reconciliation mechanism. It only activates for keys that are actually read, making it efficient for hot keys.
Tunable Consistency
Systems like Cassandra allow per-operation consistency levels:
- ONE: one replica responds. Lowest latency, highest staleness risk.
- QUORUM: majority of replicas respond. Balanced — reads and writes at QUORUM guarantee overlap (no stale reads after a QUORUM write).
- ALL: all replicas respond. Highest consistency, highest latency, unavailable if any replica is down.
SQL Schema
CREATE TABLE ReplicaState (
node_id TEXT NOT NULL,
key TEXT NOT NULL,
value JSONB NOT NULL,
vector_clock JSONB NOT NULL DEFAULT '{}'::jsonb,
updated_at TIMESTAMPTZ NOT NULL DEFAULT now(),
PRIMARY KEY (node_id, key)
);
CREATE TABLE HintedHandoff (
hint_id BIGSERIAL PRIMARY KEY,
target_node TEXT NOT NULL,
key TEXT NOT NULL,
value JSONB NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
delivered_at TIMESTAMPTZ,
expires_at TIMESTAMPTZ NOT NULL
);
CREATE INDEX idx_hints_target ON HintedHandoff (target_node, delivered_at NULLS FIRST);
CREATE TABLE ReadRepairLog (
id BIGSERIAL PRIMARY KEY,
key TEXT NOT NULL,
stale_node_id TEXT NOT NULL,
repaired_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
Python Implementation Sketch
import time, random
from typing import List, Optional
class EventuallyConsistentStore:
def __init__(self, nodes: list, quorum: int = 2):
self.nodes = nodes
self.quorum = quorum
def write(self, key: str, value: dict, consistency_level: str = 'QUORUM') -> bool:
required = self._required_acks(consistency_level)
acks = 0
for node in self.nodes:
try:
node.put(key, value, timestamp=time.time())
acks += 1
if acks >= required:
return True
except Exception:
continue
return acks >= required
def read(self, key: str, consistency_level: str = 'QUORUM') -> Optional[dict]:
required = self._required_acks(consistency_level)
responses = []
for node in self.nodes:
try:
response = node.get(key)
if response:
responses.append(response)
if len(responses) >= required:
break
except Exception:
continue
if not responses:
return None
result = self.perform_read_repair(key, responses)
return result
def perform_read_repair(self, key: str, responses: list) -> dict:
latest = max(responses, key=lambda r: r.get('timestamp', 0))
stale = [r for r in responses if r.get('timestamp', 0) int:
if level == 'ONE': return 1
if level == 'QUORUM': return self.quorum
if level == 'ALL': return len(self.nodes)
return self.quorum
def _resolve_conflict(self, a: dict, b: dict) -> dict:
# Last-write-wins by timestamp
return a if a.get('timestamp', 0) >= b.get('timestamp', 0) else b
def _build_merkle_tree(self): pass
def _diff_merkle_trees(self, a, b): return []
def _find_node(self, node_id: str): return None
def _local_get(self, key: str): return {}
def _local_put(self, key: str, value: dict): pass
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering
See also: Meta Interview Guide 2026: Facebook, Instagram, WhatsApp Engineering
See also: Databricks Interview Guide 2026: Spark Internals, Delta Lake, and Lakehouse Architecture