Multi-Region Replication Low-Level Design: Global Data Distribution and Failover

Multi-region replication keeps data available and performant for users worldwide by maintaining synchronized copies of data in multiple geographic regions. The core tension: strong consistency (all regions see the same data at the same time) requires coordination that adds cross-region latency; high availability and low latency require accepting eventual consistency. Most systems choose a hybrid — strong consistency within a region, eventual consistency across regions — and design carefully around the cases where users can see stale data.

Replication Topologies

Single primary, multiple read replicas: all writes go to one primary region; other regions have read-only replicas that receive changes via async replication. Typical replication lag: 100-500ms. Reads from any region are fast; writes must route to the primary region (adding latency for users far from the primary). Best for read-heavy applications with a clear geographic concentration of writes.

Active-active (multi-primary): each region accepts writes. Changes are replicated to all other regions asynchronously. Write conflicts (two regions write to the same row simultaneously) must be resolved via last-writer-wins, vector clocks, or application-level merge. More complex but eliminates write latency for all users. Used by systems like DynamoDB Global Tables and CockroachDB.

Region-pinned data: user data is physically stored in the user’s home region and never replicated (GDPR compliance). Requests from other regions are proxied to the user’s home region. No replication lag, no conflict resolution, but cross-region access is slow.

Data Model for Region Routing

CREATE TABLE User (
    user_id     BIGINT PRIMARY KEY,
    home_region VARCHAR(20) NOT NULL,  -- 'us-east-1', 'eu-west-1', 'ap-southeast-1'
    email       VARCHAR(255) UNIQUE NOT NULL,
    created_at  TIMESTAMPTZ DEFAULT NOW()
);

-- Global routing table replicated to all regions (read-only in non-primary)
CREATE TABLE RegionConfig (
    region_id     VARCHAR(20) PRIMARY KEY,
    db_endpoint   VARCHAR(255) NOT NULL,
    is_primary    BOOLEAN NOT NULL DEFAULT FALSE,
    is_accepting_writes BOOLEAN NOT NULL DEFAULT TRUE
);

-- Replication position tracking (for lag monitoring)
CREATE TABLE ReplicationStatus (
    source_region  VARCHAR(20) NOT NULL,
    target_region  VARCHAR(20) NOT NULL,
    last_lsn       BIGINT NOT NULL,    -- last WAL position applied
    lag_ms         INT,                -- measured replication lag
    updated_at     TIMESTAMPTZ DEFAULT NOW(),
    PRIMARY KEY (source_region, target_region)
);

Write Routing and Read-Your-Writes Consistency

class MultiRegionRouter:
    def __init__(self, local_region: str):
        self.local_region = local_region
        self.primary_region = self._get_primary_region()

    def write(self, query: str, params: list):
        if self.local_region == self.primary_region:
            return self.local_db.execute(query, params)
        else:
            # Route write to primary region synchronously
            # Adds cross-region RTT (~70ms US-EU, ~150ms US-APAC)
            return self.primary_db_client.execute(query, params)

    def read(self, query: str, params: list, user_id: int = None):
        # Check if this user recently wrote (read-your-writes consistency)
        if user_id and self._user_wrote_recently(user_id):
            # Route to primary to ensure we read the latest data
            return self.primary_db_client.execute(query, params)
        # Serve from local replica (fast, may be slightly stale)
        return self.local_db.execute(query, params)

    def _user_wrote_recently(self, user_id: int) -> bool:
        """Check if user has a pending write in the last 5 seconds."""
        key = f"wrote:{self.local_region}:{user_id}"
        return redis_local.exists(key)

    def mark_user_wrote(self, user_id: int):
        """Set after a write to ensure subsequent reads route to primary."""
        key = f"wrote:{self.local_region}:{user_id}"
        redis_local.setex(key, 5, 1)  # 5 seconds -- covers typical replication lag

Conflict Resolution for Active-Active

-- Last-writer-wins: use a hybrid logical clock (HLC) timestamp
-- Each write includes a timestamp from the writing region's HLC
CREATE TABLE UserProfile (
    user_id      BIGINT PRIMARY KEY,
    display_name VARCHAR(100),
    bio          TEXT,
    hlc_ts       BIGINT NOT NULL  -- HLC timestamp, monotonically increasing
);

-- On conflict (same user_id written in two regions simultaneously):
-- Keep the row with the higher hlc_ts
INSERT INTO UserProfile (user_id, display_name, bio, hlc_ts)
VALUES (%s, %s, %s, %s)
ON CONFLICT (user_id) DO UPDATE
SET display_name = EXCLUDED.display_name,
    bio = EXCLUDED.bio,
    hlc_ts = EXCLUDED.hlc_ts
WHERE EXCLUDED.hlc_ts > UserProfile.hlc_ts;  -- only overwrite if newer

-- For shopping carts (additive semantics, not last-writer-wins):
-- Use CRDTs (Conflict-free Replicated Data Types)
-- Each region maintains a G-Set (grow-only set) of cart items
-- Merge = union of all G-Sets from all regions

Replication Lag Monitoring and Failover

def monitor_replication_lag():
    """Alert if any region falls too far behind."""
    for status in db.fetchall("SELECT * FROM ReplicationStatus"):
        lag = status['lag_ms']
        if lag > 5000:   # 5 seconds lag
            alert_critical(f"Replication lag to {status['target_region']}: {lag}ms")
        elif lag > 1000:
            alert_warning(f"Replication lag to {status['target_region']}: {lag}ms")

def failover_to_replica(failed_region: str, replica_region: str):
    """Promote a replica to primary when the primary region fails."""
    # Step 1: Verify replica has caught up (or accept data loss up to lag)
    lag = get_replication_lag(failed_region, replica_region)
    if lag > ACCEPTABLE_DATA_LOSS_MS:
        raise FailoverRiskTooHigh(f"Replica is {lag}ms behind")

    # Step 2: Promote replica to accept writes
    db.execute("""
        UPDATE RegionConfig
        SET is_primary=TRUE, is_accepting_writes=TRUE
        WHERE region_id=%s
    """, [replica_region])

    # Step 3: Update global DNS to point to the new primary
    update_dns_record('db-primary.example.com', replica_region_endpoint)

    # Step 4: Prevent split-brain by fencing the old primary
    # (STONITH: Shoot The Other Node In The Head)
    fence_region(failed_region)

Key Interview Points

  • The CAP theorem in practice: multi-region systems must choose between consistency and availability during a network partition (region isolation). Most choose availability + eventual consistency (AP) over strong consistency (CP).
  • Read-your-writes consistency is the most important consistency guarantee for UX — users must see their own changes immediately even on eventually-consistent replicas. Implement via session pinning or a “wrote recently” flag.
  • Last-writer-wins requires monotonic timestamps (Hybrid Logical Clocks, not wall clocks — wall clocks can go backward on NTP corrections).
  • RPO (Recovery Point Objective) = maximum acceptable data loss = replication lag at time of failure. RTO (Recovery Time Objective) = maximum acceptable downtime. Design failover automation to meet both SLAs.
  • Fencing (STONITH) prevents split-brain: before promoting a replica to primary, ensure the old primary cannot accept writes — revoke its network access, invalidate its lease, or physically power it off.

Multi-region replication and global database design is discussed in Amazon system design interview questions.

Multi-region replication and global availability design is covered in Netflix system design interview preparation.

Multi-region replication and global data distribution design is discussed in Airbnb system design interview guide.

See also: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence

See also: Databricks Interview Guide 2026: Spark Internals, Delta Lake, and Lakehouse Architecture

Scroll to Top