Multi-Region Replication Low-Level Design: Global Data Distribution and Failover

Multi-region replication keeps data available and performant for users worldwide by maintaining synchronized copies of data in multiple geographic regions. The core tension: strong consistency (all regions see the same data at the same time) requires coordination that adds cross-region latency; high availability and low latency require accepting eventual consistency. Most systems choose a hybrid — strong consistency within a region, eventual consistency across regions — and design carefully around the cases where users can see stale data.

Replication Topologies

Single primary, multiple read replicas: all writes go to one primary region; other regions have read-only replicas that receive changes via async replication. Typical replication lag: 100-500ms. Reads from any region are fast; writes must route to the primary region (adding latency for users far from the primary). Best for read-heavy applications with a clear geographic concentration of writes.

Active-active (multi-primary): each region accepts writes. Changes are replicated to all other regions asynchronously. Write conflicts (two regions write to the same row simultaneously) must be resolved via last-writer-wins, vector clocks, or application-level merge. More complex but eliminates write latency for all users. Used by systems like DynamoDB Global Tables and CockroachDB.

Region-pinned data: user data is physically stored in the user’s home region and never replicated (GDPR compliance). Requests from other regions are proxied to the user’s home region. No replication lag, no conflict resolution, but cross-region access is slow.

Data Model for Region Routing

CREATE TABLE User (
    user_id     BIGINT PRIMARY KEY,
    home_region VARCHAR(20) NOT NULL,  -- 'us-east-1', 'eu-west-1', 'ap-southeast-1'
    email       VARCHAR(255) UNIQUE NOT NULL,
    created_at  TIMESTAMPTZ DEFAULT NOW()
);

-- Global routing table replicated to all regions (read-only in non-primary)
CREATE TABLE RegionConfig (
    region_id     VARCHAR(20) PRIMARY KEY,
    db_endpoint   VARCHAR(255) NOT NULL,
    is_primary    BOOLEAN NOT NULL DEFAULT FALSE,
    is_accepting_writes BOOLEAN NOT NULL DEFAULT TRUE
);

-- Replication position tracking (for lag monitoring)
CREATE TABLE ReplicationStatus (
    source_region  VARCHAR(20) NOT NULL,
    target_region  VARCHAR(20) NOT NULL,
    last_lsn       BIGINT NOT NULL,    -- last WAL position applied
    lag_ms         INT,                -- measured replication lag
    updated_at     TIMESTAMPTZ DEFAULT NOW(),
    PRIMARY KEY (source_region, target_region)
);

Write Routing and Read-Your-Writes Consistency

class MultiRegionRouter:
    def __init__(self, local_region: str):
        self.local_region = local_region
        self.primary_region = self._get_primary_region()

    def write(self, query: str, params: list):
        if self.local_region == self.primary_region:
            return self.local_db.execute(query, params)
        else:
            # Route write to primary region synchronously
            # Adds cross-region RTT (~70ms US-EU, ~150ms US-APAC)
            return self.primary_db_client.execute(query, params)

    def read(self, query: str, params: list, user_id: int = None):
        # Check if this user recently wrote (read-your-writes consistency)
        if user_id and self._user_wrote_recently(user_id):
            # Route to primary to ensure we read the latest data
            return self.primary_db_client.execute(query, params)
        # Serve from local replica (fast, may be slightly stale)
        return self.local_db.execute(query, params)

    def _user_wrote_recently(self, user_id: int) -> bool:
        """Check if user has a pending write in the last 5 seconds."""
        key = f"wrote:{self.local_region}:{user_id}"
        return redis_local.exists(key)

    def mark_user_wrote(self, user_id: int):
        """Set after a write to ensure subsequent reads route to primary."""
        key = f"wrote:{self.local_region}:{user_id}"
        redis_local.setex(key, 5, 1)  # 5 seconds -- covers typical replication lag

Conflict Resolution for Active-Active

-- Last-writer-wins: use a hybrid logical clock (HLC) timestamp
-- Each write includes a timestamp from the writing region's HLC
CREATE TABLE UserProfile (
    user_id      BIGINT PRIMARY KEY,
    display_name VARCHAR(100),
    bio          TEXT,
    hlc_ts       BIGINT NOT NULL  -- HLC timestamp, monotonically increasing
);

-- On conflict (same user_id written in two regions simultaneously):
-- Keep the row with the higher hlc_ts
INSERT INTO UserProfile (user_id, display_name, bio, hlc_ts)
VALUES (%s, %s, %s, %s)
ON CONFLICT (user_id) DO UPDATE
SET display_name = EXCLUDED.display_name,
    bio = EXCLUDED.bio,
    hlc_ts = EXCLUDED.hlc_ts
WHERE EXCLUDED.hlc_ts > UserProfile.hlc_ts;  -- only overwrite if newer

-- For shopping carts (additive semantics, not last-writer-wins):
-- Use CRDTs (Conflict-free Replicated Data Types)
-- Each region maintains a G-Set (grow-only set) of cart items
-- Merge = union of all G-Sets from all regions

Replication Lag Monitoring and Failover

def monitor_replication_lag():
    """Alert if any region falls too far behind."""
    for status in db.fetchall("SELECT * FROM ReplicationStatus"):
        lag = status['lag_ms']
        if lag > 5000:   # 5 seconds lag
            alert_critical(f"Replication lag to {status['target_region']}: {lag}ms")
        elif lag > 1000:
            alert_warning(f"Replication lag to {status['target_region']}: {lag}ms")

def failover_to_replica(failed_region: str, replica_region: str):
    """Promote a replica to primary when the primary region fails."""
    # Step 1: Verify replica has caught up (or accept data loss up to lag)
    lag = get_replication_lag(failed_region, replica_region)
    if lag > ACCEPTABLE_DATA_LOSS_MS:
        raise FailoverRiskTooHigh(f"Replica is {lag}ms behind")

    # Step 2: Promote replica to accept writes
    db.execute("""
        UPDATE RegionConfig
        SET is_primary=TRUE, is_accepting_writes=TRUE
        WHERE region_id=%s
    """, [replica_region])

    # Step 3: Update global DNS to point to the new primary
    update_dns_record('db-primary.example.com', replica_region_endpoint)

    # Step 4: Prevent split-brain by fencing the old primary
    # (STONITH: Shoot The Other Node In The Head)
    fence_region(failed_region)

Key Interview Points

The CAP theorem in practice: multi-region systems must choose between consistency and availability during a network partition (region isolation). Most choose availability + eventual consistency (AP) over strong consistency (CP).
Read-your-writes consistency is the most important consistency guarantee for UX — users must see their own changes immediately even on eventually-consistent replicas. Implement via session pinning or a “wrote recently” flag.
Last-writer-wins requires monotonic timestamps (Hybrid Logical Clocks, not wall clocks — wall clocks can go backward on NTP corrections).
RPO (Recovery Point Objective) = maximum acceptable data loss = replication lag at time of failure. RTO (Recovery Time Objective) = maximum acceptable downtime. Design failover automation to meet both SLAs.
Fencing (STONITH) prevents split-brain: before promoting a replica to primary, ensure the old primary cannot accept writes — revoke its network access, invalidate its lease, or physically power it off.

{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”What is the difference between active-active and active-passive multi-region setups?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Active-passive (single primary): one region accepts all writes; other regions are read-only replicas synchronized via async replication. Writes from non-primary regions must be proxied to the primary, adding cross-region latency (50-200ms RTT). Simple conflict resolution (there is only one writer). Failover requires promoting a replica to primary — typically 30-120 seconds. Active-active (multi-primary): every region accepts writes locally, replicates to others asynchronously. Write latency is always local (fast for all users). Requires conflict resolution when two regions write to the same row concurrently (last-writer-wins, CRDTs, or application-level merge). Failover is instant — other regions already accept writes. Complexity is much higher. Choose active-passive for most applications; use active-active only when write latency for global users is a hard requirement.”}},{“@type”:”Question”,”name”:”How do you implement read-your-writes consistency across regions?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”After a user performs a write that goes to the primary region, their subsequent reads in a different region may hit a replica that hasn’t received the replication yet. They see stale data despite having just written. Two solutions: (1) Session pinning — store the primary region LSN at write time in the user’s session. Before serving a read from a replica, check if the replica has applied at least that LSN (SELECT pg_last_wal_replay_lsn()). If not, proxy the read to the primary. (2) Sticky routing — for 5-10 seconds after a write, route all reads from that user to the primary region. Store the flag in a local Redis key: SETEX wrote:{user_id} 10 1. Check before each read. Both approaches add slight overhead for the 1-5% of reads that follow a recent write; the other 95%+ of reads are served from local replicas.”}},{“@type”:”Question”,”name”:”How do you resolve write conflicts in an active-active setup?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Last-writer-wins (LWW): keep the write with the higher timestamp. Requires monotonic timestamps — use Hybrid Logical Clocks (HLC), not wall clocks (which can go backward on NTP sync). Implement as ON CONFLICT DO UPDATE WHERE excluded.hlc_ts > current.hlc_ts. Works well for profile updates where the latest value is correct. CRDTs (Conflict-free Replicated Data Types): design data structures that automatically merge without conflicts. A shopping cart can be a G-Set (grow-only set) — adding an item in two regions simultaneously is not a conflict, both additions are kept, merge is union. Counters use PN-counters (increment/decrement replicated independently). Application-level merge: for complex objects (documents, settings), store an edit log per region and merge on read using operational transforms. CRDTs are the most principled approach; LWW is the simplest.”}},{“@type”:”Question”,”name”:”What is RPO and RTO and how do they shape multi-region design?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”RPO (Recovery Point Objective): maximum data loss acceptable if a region fails. Measured in time: RPO=0 means zero data loss (requires synchronous replication, adds latency). RPO=30s means up to 30 seconds of writes may be lost (async replication is acceptable if lag is under 30s). RTO (Recovery Time Objective): maximum downtime acceptable after a failure. RTO=5m means the system must be serving traffic within 5 minutes of a regional failure. These SLAs directly determine architecture: RPO=0 requires synchronous writes to multiple regions (adds latency, expensive). RPO=60s allows async replication (fast writes, some risk). RTO=1min requires automated failover with pre-warmed replicas. RTO=15min allows manual failover. Define these SLAs explicitly before designing replication — they determine cost, complexity, and latency trade-offs.”}},{“@type”:”Question”,”name”:”What is split-brain and how do you prevent it during regional failover?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Split-brain occurs when two regions both believe they are the primary and simultaneously accept writes. If region A fails and region B is promoted to primary, then region A comes back online still believing it is primary, both accept writes — the data diverges irrecoverably. Prevention: fencing (STONITH — Shoot The Other Node In The Head). Before promoting region B to primary: (1) Revoke region A’s ability to write by updating its DNS record, revoking its database credentials, or terminating its network access. (2) Use a distributed lease: the primary holds a lease from a Raft-based consensus system (etcd, ZooKeeper). The lease has a TTL; region B only accepts promotion after region A’s lease has expired and it has acquired a new one. The lease TTL establishes the minimum RTO — you cannot failover faster than the lease expiration.”}}]}

Multi-region replication and global database design is discussed in Amazon system design interview questions.

Multi-region replication and global availability design is covered in Netflix system design interview preparation.

Multi-region replication and global data distribution design is discussed in Airbnb system design interview guide.