Conflict-Free Replication Low-Level Design: CRDTs, Operational Transforms, and Convergent Data Types

⏱ 9 min read

What Are CRDTs?

A Conflict-free Replicated Data Type (CRDT) is a data structure with a mathematically defined merge operation that is commutative (order of merges does not matter), associative (grouping of merges does not matter), and idempotent (merging the same state twice has no additional effect). These three properties together guarantee that any set of replicas that have received the same set of updates will converge to the same state, regardless of the order in which updates and merges arrived. CRDTs eliminate the need for conflict resolution because conflicts structurally cannot occur.

G-Counter (Grow-Only Counter)

Each replica maintains its own slot in a vector of counters. Increment operations apply only to the local replica's slot. The global value is the sum of all slots. Merge is component-wise max: for each slot, take the larger of the two values. Because slots are independent and only ever increase, merges always converge.

G-Counter state: {node_A: 5, node_B: 3, node_C: 2}
Value:           5 + 3 + 2 = 10
Merge({A:5,B:3,C:2}, {A:4,B:6,C:2}) = {A:5, B:6, C:2}  → value 13

PN-Counter (Increment/Decrement Counter)

A PN-Counter pairs two G-Counters: one for increments (P) and one for decrements (N). The value is P.value() – N.value(). Merge is applied independently to both G-Counters. This allows both increment and decrement while preserving convergence.

2P-Set (Two-Phase Set)

A 2P-Set maintains two G-Sets: an add set and a tombstone (remove) set. An element is a member if it is in the add set but not in the tombstone set. Once an element is removed (added to the tombstone set), it can never be re-added. Merge is set union on both the add set and tombstone set independently.

OR-Set (Observed-Remove Set)

The OR-Set improves on 2P-Set by allowing elements to be removed and re-added. Each add operation tags the element with a unique token (e.g., UUID). Remove operations remove all currently observed tokens for the element. A concurrent add and remove are resolved gracefully: the add wins because it generates a new token that was not in the remove's observed set. Merge takes the union of all (element, token) pairs and removes pairs that are in both the add and remove sets.

LWW-Register (Last-Write-Wins Register)

An LWW-Register stores a single value tagged with a timestamp. The merge operation selects the value with the higher timestamp. This requires synchronized clocks or logical timestamps (Lamport clocks, hybrid logical clocks). HLCs combine a physical timestamp with a logical counter, providing causality tracking without requiring perfectly synchronized clocks. Each write advances the HLC beyond the current physical time and any observed HLC.

LWW-Element-Set

An LWW-Element-Set extends LWW-Register to sets: each element has its own timestamp for the most recent add and remove. An element is present if its add-timestamp is greater than its remove-timestamp. Merge takes the per-element max of both timestamps independently.

RGA: Replicated Growable Array

RGA is a sequence CRDT for ordered lists (collaborative text editing). Each element is assigned a globally unique ID (timestamp + node ID). Elements are ordered by their unique ID. Insert operations add an element after a specified existing element ID. Delete operations add the element to a tombstone set (the element remains in the list but is marked invisible). Merge takes the union of all elements and tombstones, then sorts by ID. This guarantees that concurrent inserts at the same position produce a deterministic order across all replicas.

Causal Context

Causal context (also called a version vector or dot-cloud) tracks each replica's causal history. It allows a replica to distinguish genuinely new operations from re-delivered or duplicated ones. When a replica receives an update, it compares the update's causal context with its own: if the update's context is already subsumed by the local context, the update is a duplicate and is discarded without applying. This enables exactly-once semantics over an at-least-once delivery network.

SQL Schema

CREATE TABLE CRDTState (
    id           BIGSERIAL    PRIMARY KEY,
    node_id      VARCHAR(64)  NOT NULL,
    crdt_name    VARCHAR(128) NOT NULL,
    crdt_type    VARCHAR(32)  NOT NULL,
    state_blob   BYTEA        NOT NULL,
    vector_clock JSONB        NOT NULL DEFAULT '{}',
    updated_at   TIMESTAMPTZ  NOT NULL DEFAULT now(),
    UNIQUE (node_id, crdt_name)
);

CREATE TABLE CRDTMergeLog (
    id           BIGSERIAL    PRIMARY KEY,
    crdt_name    VARCHAR(128) NOT NULL,
    source_node  VARCHAR(64)  NOT NULL,
    merged_at    TIMESTAMPTZ  NOT NULL DEFAULT now(),
    state_before BYTEA        NOT NULL,
    state_after  BYTEA        NOT NULL
);

CREATE INDEX idx_cs_crdt_node ON CRDTState    (crdt_name, node_id);
CREATE INDEX idx_cml_crdt     ON CRDTMergeLog (crdt_name, merged_at DESC);

Python Implementation

import uuid
from typing import Any, Dict, Optional, Set, Tuple

# ── G-Counter ─────────────────────────────────────────────────────────────────

class GCounter:
    def __init__(self, node_id: str):
        self.node_id = node_id
        self._counts: Dict[str, int] = {node_id: 0}

    def increment(self, node_id: Optional[str] = None):
        nid = node_id or self.node_id
        self._counts[nid] = self._counts.get(nid, 0) + 1

    def value(self) -> int:
        return sum(self._counts.values())

    def merge(self, other: 'GCounter'):
        for nid, count in other._counts.items():
            self._counts[nid] = max(self._counts.get(nid, 0), count)

    def state(self) -> Dict:
        return dict(self._counts)


# ── OR-Set (Observed-Remove Set) ──────────────────────────────────────────────

class ORSet:
    def __init__(self):
        self._elements: Dict[Any, Set[str]] = {}  # element -> set of tokens
        self._tombstones: Dict[Any, Set[str]] = {}

    def add(self, element: Any):
        token = str(uuid.uuid4())
        if element not in self._elements:
            self._elements[element] = set()
        self._elements[element].add(token)

    def remove(self, element: Any):
        if element in self._elements:
            observed = set(self._elements[element])
            if element not in self._tombstones:
                self._tombstones[element] = set()
            self._tombstones[element].update(observed)

    def contains(self, element: Any) -> bool:
        live_tokens = self._elements.get(element, set()) - self._tombstones.get(element, set())
        return len(live_tokens) > 0

    def members(self) -> Set:
        return {e for e in self._elements if self.contains(e)}

    def merge(self, other: 'ORSet'):
        for element, tokens in other._elements.items():
            if element not in self._elements:
                self._elements[element] = set()
            self._elements[element].update(tokens)
        for element, tokens in other._tombstones.items():
            if element not in self._tombstones:
                self._tombstones[element] = set()
            self._tombstones[element].update(tokens)


# ── LWW-Register ──────────────────────────────────────────────────────────────

class LWWRegister:
    def __init__(self):
        self._value: Any        = None
        self._ts:    float      = 0.0

    def write(self, value: Any, timestamp: float):
        if timestamp > self._ts:
            self._value = value
            self._ts    = timestamp

    def read(self) -> Any:
        return self._value

    def merge(self, other: 'LWWRegister'):
        if other._ts > self._ts:
            self._value = other._value
            self._ts    = other._ts

CRDT Memory Overhead

CRDTs trade memory for coordination-free convergence. G-Counters grow O(n) with the number of nodes. OR-Sets grow O(elements x replicas) because every add generates a new token. LWW-Registers are O(1) but require clock synchronization. In large deployments, state-based CRDTs (which broadcast full state for merges) can be replaced by op-based CRDTs (which broadcast only the operation) to reduce network overhead — but op-based CRDTs require exactly-once delivery semantics.

Frequently Asked Questions

What is the difference between CRDTs and Operational Transforms?

Operational Transforms (OT) were developed for collaborative text editing. OT transforms concurrent operations so they can be applied in any order and produce the same result. OT requires a central server to linearize operations and is complex to implement correctly for arbitrary data types. CRDTs are decentralized — replicas merge state without coordination — and have formal mathematical convergence proofs. CRDTs are preferred for distributed systems without a central coordinator; OT is still used in some collaborative editors (Google Docs originally used OT).

Why is OR-Set preferred over 2P-Set for most use cases?

2P-Set has a fundamental limitation: once an element is removed, it can never be re-added. This is because the tombstone set is a G-Set (grow-only), and the tombstone permanently overrides the add. OR-Set avoids this by tagging each add with a unique token. A remove only removes the tokens that were observed at remove time. A subsequent add generates a new token, allowing the element to be a member again. OR-Set models real-world semantics more naturally — items can be removed from a shopping cart and then re-added.

What clock guarantees does an LWW-Register require?

An LWW-Register requires that timestamps are monotonically increasing and that two concurrent writes from different nodes produce distinct timestamps. Wall-clock time (NTP) alone is insufficient — NTP can drift and two nodes may generate the same millisecond timestamp. Hybrid Logical Clocks (HLC) solve this: each HLC timestamp is the max of physical time and the highest observed HLC, plus a logical counter. This ensures causality (a write that observes another always has a higher HLC) and provides globally unique, monotonically increasing timestamps without tight clock synchronization.

How much memory do CRDTs consume in large clusters?

G-Counters consume O(n) memory per CRDT instance where n is the number of nodes — one slot per node. OR-Sets consume O(elements x mean_add_count) because each add creates a token that persists until garbage collected. In practice, OR-Set state grows without bound unless a garbage collection protocol (requiring a distributed snapshot of all replicas) is applied to remove obsolete tokens. For deployments with many nodes or high add/remove rates, op-based CRDTs that only transmit the operation (not full state) significantly reduce network overhead, at the cost of requiring at-least-once delivery with deduplication.