Multi-Master Replication Low-Level Design: Circular Replication, Conflict Avoidance, and Write Coordination

What Is Multi-Master Replication?

In multi-master replication, all nodes are primaries — every node accepts both reads and writes simultaneously. Changes made on any node are replicated to all other nodes. This maximizes write availability: the cluster continues to accept writes even if all but one node fail. The fundamental challenge is that concurrent writes to the same row on different masters create conflicts that must be detected and resolved, and the replication topology must be carefully managed to avoid infinite replication loops.

Replication Topologies

  • Circular (ring): A replicates to B, B replicates to C, C replicates back to A. Simple and requires only one replication connection per node, but a single node failure breaks the ring unless bypass connections are added.
  • Star (hub-and-spoke): all nodes replicate through a central hub. The hub is a single point of failure; losing the hub stops replication across the cluster.
  • Mesh (all-to-all): every node replicates to every other node directly. Highest resilience — any node failure does not affect replication between remaining nodes. O(n^2) replication connections; practical up to 5-6 nodes.

Preventing Replication Loops

In a circular or mesh topology, an event originating on node A and replicated to node B must not be re-replicated back to A — this would create an infinite loop. MySQL prevents this with the server_id mechanism: each server has a unique integer ID. Binlog events carry the originating server_id. When a replica receives an event, it checks whether the server_id matches its own ID; if so, it ignores the event as one it originated. This ensures each event is applied exactly once per node regardless of topology complexity.

Write Conflicts

The same row updated on two masters simultaneously is the core problem. Unlike active-active (where conflicts can often be avoided by routing), multi-master topologies are typically used precisely because any node should accept any write — making conflict avoidance via routing difficult. Conflicts are detected by comparing binlog events for the same primary key with overlapping timestamps.

Conflict Avoidance: Primary Key Range Partitioning

The most effective conflict avoidance strategy is to partition the write space by primary key range:

  • Master A owns inserts for IDs 1 – 1,000,000
  • Master B owns inserts for IDs 1,000,001 – 2,000,000
  • Master C owns inserts for IDs 2,000,001 – 3,000,000

Each master only inserts rows in its assigned range. Updates to existing rows must still be routed to the master that owns the row's range. Cross-range foreign keys require careful schema design to avoid deadlocks. MySQL supports auto-increment offset and increment settings (auto_increment_offset, auto_increment_increment) to implement this automatically for sequential integer PKs.

Global Sequences: UUID and Snowflake IDs

UUID v4 (random) and Snowflake IDs (timestamp + node ID + sequence) generate globally unique identifiers without central coordination. No two masters will ever generate the same PK, eliminating insert-insert conflicts entirely. The tradeoff: UUIDs cause index fragmentation in B-tree indexes due to random ordering; Snowflake IDs are time-ordered and friendlier to clustered indexes.

Galera Cluster: Certification-Based Multi-Master

Galera Cluster (used by MariaDB Cluster and Percona XtraDB Cluster) implements synchronous multi-master replication via certification-based conflict detection. When a transaction commits, Galera broadcasts the write set (the set of rows modified) to all nodes. Each node independently certifies the write set against its local transaction history: if no conflicting write set has been certified since the transaction began, the commit succeeds on all nodes simultaneously. If a conflict is detected, one transaction is rolled back. This guarantees:

  • No data divergence — all nodes always have identical data.
  • Synchronous replication — a commit is not acknowledged until all nodes certify it.
  • Write scalability limited by the slowest node and the certification overhead.

SQL Schema

CREATE TABLE MasterNode (
    id             BIGSERIAL    PRIMARY KEY,
    address        VARCHAR(256) NOT NULL,
    pk_range_start BIGINT       NOT NULL,
    pk_range_end   BIGINT       NOT NULL,
    server_id      INT          NOT NULL UNIQUE,
    status         VARCHAR(32)  NOT NULL DEFAULT 'active'
);

CREATE TABLE ReplicationConflict (
    id                 BIGSERIAL    PRIMARY KEY,
    key                VARCHAR(256) NOT NULL,
    master_a_id        BIGINT       NOT NULL REFERENCES MasterNode(id),
    master_b_id        BIGINT       NOT NULL REFERENCES MasterNode(id),
    conflict_timestamp TIMESTAMPTZ  NOT NULL DEFAULT now(),
    resolution         VARCHAR(64)
);

CREATE INDEX idx_mn_server_id ON MasterNode          (server_id);
CREATE INDEX idx_rc_key_ts    ON ReplicationConflict (key, conflict_timestamp DESC);

Python Implementation

import time
import threading
from typing import Any, Dict, List, Optional, Tuple
from dataclasses import dataclass

@dataclass
class MasterNode:
    node_id:       int
    server_id:     int
    pk_range:      Tuple[int, int]   # (start, end) inclusive
    store:         Dict = None

    def __post_init__(self):
        if self.store is None:
            self.store = {}

    def owns_key(self, key: int) -> bool:
        return self.pk_range[0] <= key  bool:
        """Route write to the master that owns this PK range."""
        owner = next(
            (n for n in self.nodes.values() if n.owns_key(key)), None
        )
        if owner is None:
            raise ValueError(f"No master owns key {key}")
        with self._lock:
            owner.store[key] = value
            event = {
                'server_id': owner.server_id,
                'key':       key,
                'value':     value,
                'ts':        time.time()
            }
            self._events.append(event)
        self._replicate(event, source_server_id=owner.server_id)
        return True

    def _replicate(self, event: Dict, source_server_id: int):
        """Replicate event to all nodes except the source (loop prevention via server_id)."""
        for sid, node in self.nodes.items():
            if sid == source_server_id:
                continue   # skip originating node (server_id check)
            with self._lock:
                existing = node.store.get(event['key'])
                if existing is None:
                    node.store[event['key']] = event['value']
                else:
                    conflict = self.detect_conflict(
                        event['key'],
                        {'value': existing, 'server_id': sid},
                        event
                    )
                    if conflict:
                        resolved = self.merge_conflict(
                            event['key'],
                            {'value': existing},
                            {'value': event['value']}
                        )
                        node.store[event['key']] = resolved

    def detect_conflict(self, key: int, version_a: Dict, version_b: Dict) -> bool:
        """True if two versions represent a genuine write-write conflict."""
        return (version_a.get('server_id') != version_b.get('server_id') and
                version_a.get('value') != version_b.get('value'))

    def merge_conflict(self, key: int, version_a: Dict, version_b: Dict) -> Any:
        """
        Application merge: last-write-wins by timestamp.
        Override with domain-specific logic (union, field-level merge, etc.)
        """
        ts_a = version_a.get('ts', 0)
        ts_b = version_b.get('ts', 0)
        winner = version_a if ts_a >= ts_b else version_b
        print(f"Conflict on key {key} resolved: {winner['value']}")
        return winner['value']

Conflict Rate Monitoring

Multi-master conflict rate should be tracked as an operational metric. A conflict rate above 0.1% per minute indicates that the PK range partitioning is misconfigured, that writes are not being correctly routed to the owning master, or that the workload has an inherent write-sharing pattern that requires a different architecture (e.g., Galera certification for guaranteed convergence, or reverting to single-master for conflicted tables).

Frequently Asked Questions

What is the difference between circular and mesh multi-master topology?

In circular topology, each node replicates to one neighbor (A-to-B-to-C-to-A). It requires only one replication connection per node but is fragile: a single node failure breaks the ring for nodes downstream. In mesh topology, every node replicates directly to every other node. Any single node failure leaves the remaining nodes fully connected. Mesh requires O(n^2) connections but is the preferred topology for clusters of 3-5 nodes where resilience is critical.

How does server_id prevent replication loops in MySQL?

Every binlog event carries the server_id of the MySQL instance that originally generated it. When a replica receives an event and finds that the server_id matches its own configured server_id, it discards the event without applying it. This ensures that no matter how many hops an event travels through the topology, each node applies it exactly once and never re-replicates its own events back into the cluster.

How does primary key range partitioning prevent write conflicts?

By assigning each master exclusive ownership of a non-overlapping PK range, insert conflicts are structurally impossible — no two masters will ever insert a row with the same PK. Updates must be routed to the master that owns the PK range for the target row, ensuring that concurrent updates to the same row always go to the same master. Cross-range updates (joining rows owned by different masters) require application-level coordination.

What is Galera certification-based replication?

Galera Cluster broadcasts each transaction's write set to all nodes before committing. Every node independently checks whether any concurrent transaction has modified the same rows (certification). If no conflict is found, all nodes commit simultaneously — achieving true synchronous multi-master replication with no data divergence. If a conflict is detected, one transaction is aborted and the client retries. This eliminates the need for conflict resolution post-commit but adds certification latency to every write.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What is the difference between circular and mesh multi-master topology?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Circular topology has each node replicate to one neighbor, requiring one connection per node but breaking on single node failure. Mesh topology has every node replicate directly to every other node — fully resilient to single node failure but requires O(n^2) connections.”
}
},
{
“@type”: “Question”,
“name”: “How does server_id prevent replication loops in MySQL?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Every binlog event carries the originating server_id. When a replica receives an event whose server_id matches its own, it discards the event without applying it. This ensures each event is applied exactly once per node regardless of topology complexity.”
}
},
{
“@type”: “Question”,
“name”: “How does primary key range partitioning prevent write conflicts?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “By assigning each master exclusive ownership of a non-overlapping PK range, insert conflicts are impossible — no two masters generate the same PK. Updates are routed to the owning master, ensuring concurrent updates to the same row always go to the same node.”
}
},
{
“@type”: “Question”,
“name”: “What is Galera certification-based replication?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Galera broadcasts each transaction’s write set to all nodes before committing. Every node independently certifies it for conflicts. If no conflict is found, all nodes commit simultaneously. If a conflict is detected, one transaction is aborted. This achieves synchronous multi-master replication with no post-commit conflict resolution needed.”
}
}
]
}

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How are write-write conflicts detected in multi-master replication?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Each write is tagged with a timestamp or vector clock, and when a node receives a replicated write for a row it has also modified locally, it compares the timestamps or clocks to detect the conflict. The database engine or application then applies a resolution rule such as last-write-wins, merge, or routing the conflict to a human review queue.”
}
},
{
“@type”: “Question”,
“name”: “How does circular replication avoid infinite loops?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Each write is stamped with the originating server's ID; when a node receives a replicated event and recognises its own server ID in the origin chain, it discards the event instead of re-replicating it. MySQL Group Replication and Galera Cluster implement this via a global transaction ID (GTID) set that each node tracks to deduplicate already-applied transactions.”
}
},
{
“@type”: “Question”,
“name”: “How is schema change coordination handled across masters?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Schema changes are the most dangerous operation in multi-master setups because an incompatible DDL applied on one master can break replication on others. The safest approach is to use an online schema change tool (e.g., pt-online-schema-change or gh-ost) that applies changes in a backward-compatible sequence, or to temporarily quiesce all writes before rolling out the DDL.”
}
},
{
“@type”: “Question”,
“name”: “What is the difference between multi-master and active-active?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Multi-master is a database replication topology where multiple nodes accept writes and synchronise changes with each other, while active-active is a broader availability architecture where multiple full application stacks (including databases) serve traffic simultaneously across regions or data centres. Active-active systems often use multi-master replication underneath but the term describes the whole system availability pattern, not just the database layer.”
}
}
]
}

See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering

See also: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence

See also: Databricks Interview Guide 2026: Spark Internals, Delta Lake, and Lakehouse Architecture

Scroll to Top