Cross-Region Replication Low-Level Design: Async Replication, Conflict Resolution, and Failover

Why Cross-Region Replication?

Cross-region replication serves three primary goals: disaster recovery (survive a full regional outage), lower read latency for globally distributed users (reads served from nearest region), and data residency compliance (keep certain data within specific geographic boundaries).

Each goal has different requirements. DR prioritizes failover automation and RPO/RTO targets. Latency optimization prioritizes routing reads to the nearest healthy replica. Residency compliance prioritizes data filtering and audit trails.

Async Replication

In asynchronous replication, the primary region commits writes locally and returns success to the client before data is replicated to secondary regions. Replication happens in the background over a dedicated channel.

The replication lag — the time between a commit on the primary and its application on the secondary — is typically seconds to minutes depending on network bandwidth, primary write volume, and secondary apply throughput. Under heavy write load, lag can grow to hours if the secondary cannot keep up.

Replication Channel

Two common mechanisms for shipping changes to secondaries:

  • WAL streaming: the DB engine streams its write-ahead log to the secondary, which applies log records in order. Used by PostgreSQL streaming replication. Ordering is guaranteed; schema must match.
  • CDC via Kafka: a CDC connector captures row-level changes from the primary and publishes them to a Kafka topic. The secondary consumes the topic and applies changes. More flexible — secondary can have a different schema or even a different DB engine.

Replication Lag Monitoring

Lag is monitored by comparing positions:

  • WAL-based: lag_bytes = primary_lsn - secondary_apply_lsn. Convert to seconds using write rate.
  • CDC-based: lag_seconds = primary_event_timestamp - latest_applied_event_timestamp.

Alert when lag exceeds the RPO target. A replication channel record tracks current lag per region pair and triggers alerts or automated failover preparation when thresholds are crossed.

Read Routing

Reads are routed to the geographically nearest region. Staleness is acceptable for most reads (user profiles, product catalog). Writes always go to the primary region. Applications that require read-your-own-writes consistency must either route reads to the primary or implement a write-propagation wait before redirecting to the local secondary.

Failover and Conflict on Promotion

When the primary region fails, the secondary is promoted to primary. The key risk:

  • Committed but unreplicated transactions on the old primary are lost (RPO > 0).
  • If the old primary recovers later, it may have writes that the new primary does not — a split-brain scenario.

Mitigation: on promotion, the new primary records the highest applied LSN/offset. When the old primary rejoins, it is demoted to secondary and fast-forwarded past its unreplicated writes (which are discarded or reconciled). An automated fencing mechanism (revoke old primary's write credentials or update DNS before completing promotion) prevents simultaneous dual-primary writes.

Post-Failover Sync

After failover, the original primary (now rejoining as secondary) must be re-synchronized. It streams from the new primary starting at the highest common LSN. A sync_post_failover job tracks the catch-up progress and marks the rejoined region as a healthy secondary once it falls within lag thresholds.

Replication Filters

Not all tables need to be replicated. Filters exclude:

  • Temporary or session tables that are irrelevant to secondaries.
  • High-volume tables (e.g., audit logs, metrics) where data loss on failover is acceptable and replication overhead is undesirable.
  • Tables with data that must stay in the primary region for compliance reasons.

SQL Schema

CREATE TABLE ReplicationChannel (
    id               SERIAL PRIMARY KEY,
    primary_region   TEXT NOT NULL,
    secondary_region TEXT NOT NULL,
    status           TEXT NOT NULL DEFAULT 'active' CHECK (status IN ('active','lagging','failed','rebuilding')),
    apply_lsn        BIGINT NOT NULL DEFAULT 0,
    primary_lsn      BIGINT NOT NULL DEFAULT 0,
    lag_seconds      INT NOT NULL DEFAULT 0,
    last_seen_at     TIMESTAMPTZ NOT NULL DEFAULT now(),
    UNIQUE (primary_region, secondary_region)
);

CREATE TABLE FailoverEvent (
    id              SERIAL PRIMARY KEY,
    from_region     TEXT NOT NULL,
    to_region       TEXT NOT NULL,
    trigger_reason  TEXT NOT NULL,
    started_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
    completed_at    TIMESTAMPTZ,
    data_loss_bytes BIGINT
);

CREATE INDEX idx_channel_status ON ReplicationChannel (status, lag_seconds);

Python Implementation Sketch

import time

class CrossRegionReplicationManager:
    def __init__(self, db, dns_provider):
        self.db = db
        self.dns = dns_provider
        self.lag_alert_threshold = 60   # seconds
        self.failover_lag_threshold = 300  # auto-failover if primary unreachable and lag  int:
        channel = self.db.fetchone("SELECT * FROM ReplicationChannel WHERE id = %s", (channel_id,))
        lag = channel['lag_seconds']
        if lag > self.lag_alert_threshold:
            self._alert(f"Replication lag {lag}s exceeds threshold on channel {channel_id}")
        if lag > self.lag_alert_threshold * 2:
            self.db.execute(
                "UPDATE ReplicationChannel SET status = 'lagging' WHERE id = %s", (channel_id,)
            )
        return lag

    def trigger_failover(self, primary_region: str, secondary_region: str) -> int:
        event_id = self.db.fetchone(
            "INSERT INTO FailoverEvent (from_region, to_region, trigger_reason) VALUES (%s, %s, %s) RETURNING id",
            (primary_region, secondary_region, 'primary_unreachable')
        )['id']
        # Fence old primary: revoke write credentials or update security group
        self._fence_primary(primary_region)
        # Promote secondary
        self._promote_secondary(secondary_region)
        # Update DNS to point to new primary
        self.dns.update_record('db.primary.example.com', secondary_region)
        # Update channel metadata
        self.db.execute(
            "UPDATE ReplicationChannel SET status = 'active', primary_region = %s, secondary_region = %s WHERE primary_region = %s AND secondary_region = %s",
            (secondary_region, primary_region, primary_region, secondary_region)
        )
        self.db.execute(
            "UPDATE FailoverEvent SET completed_at = now() WHERE id = %s", (event_id,)
        )
        return event_id

    def sync_post_failover(self, rejoining_region: str):
        # Find the channel where rejoining_region is now the secondary
        channel = self.db.fetchone(
            "SELECT * FROM ReplicationChannel WHERE secondary_region = %s", (rejoining_region,)
        )
        # Stream from new primary starting at channel apply_lsn
        # Monitor until lag falls within threshold
        while True:
            lag = self.monitor_lag(channel['id'])
            if lag < self.lag_alert_threshold:
                self.db.execute(
                    "UPDATE ReplicationChannel SET status = 'active' WHERE id = %s", (channel['id'],)
                )
                break
            time.sleep(10)

    def _fence_primary(self, region: str):
        pass  # Revoke write access: update firewall rules or rotate credentials

    def _promote_secondary(self, region: str):
        pass  # Execute promotion command on secondary DB: pg_promote() or equivalent

    def _alert(self, message: str):
        print(f"ALERT: {message}")

See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering

See also: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence

See also: Anthropic Interview Guide 2026: Process, Questions, and AI Safety

Scroll to Top