Cross-Region Replication Low-Level Design: Async Replication, Conflict Resolution, and Failover

Why Cross-Region Replication?

Cross-region replication serves three primary goals: disaster recovery (survive a full regional outage), lower read latency for globally distributed users (reads served from nearest region), and data residency compliance (keep certain data within specific geographic boundaries).

Each goal has different requirements. DR prioritizes failover automation and RPO/RTO targets. Latency optimization prioritizes routing reads to the nearest healthy replica. Residency compliance prioritizes data filtering and audit trails.

Async Replication

In asynchronous replication, the primary region commits writes locally and returns success to the client before data is replicated to secondary regions. Replication happens in the background over a dedicated channel.

The replication lag — the time between a commit on the primary and its application on the secondary — is typically seconds to minutes depending on network bandwidth, primary write volume, and secondary apply throughput. Under heavy write load, lag can grow to hours if the secondary cannot keep up.

Replication Channel

Two common mechanisms for shipping changes to secondaries:

  • WAL streaming: the DB engine streams its write-ahead log to the secondary, which applies log records in order. Used by PostgreSQL streaming replication. Ordering is guaranteed; schema must match.
  • CDC via Kafka: a CDC connector captures row-level changes from the primary and publishes them to a Kafka topic. The secondary consumes the topic and applies changes. More flexible — secondary can have a different schema or even a different DB engine.

Replication Lag Monitoring

Lag is monitored by comparing positions:

  • WAL-based: lag_bytes = primary_lsn - secondary_apply_lsn. Convert to seconds using write rate.
  • CDC-based: lag_seconds = primary_event_timestamp - latest_applied_event_timestamp.

Alert when lag exceeds the RPO target. A replication channel record tracks current lag per region pair and triggers alerts or automated failover preparation when thresholds are crossed.

Read Routing

Reads are routed to the geographically nearest region. Staleness is acceptable for most reads (user profiles, product catalog). Writes always go to the primary region. Applications that require read-your-own-writes consistency must either route reads to the primary or implement a write-propagation wait before redirecting to the local secondary.

Failover and Conflict on Promotion

When the primary region fails, the secondary is promoted to primary. The key risk:

  • Committed but unreplicated transactions on the old primary are lost (RPO > 0).
  • If the old primary recovers later, it may have writes that the new primary does not — a split-brain scenario.

Mitigation: on promotion, the new primary records the highest applied LSN/offset. When the old primary rejoins, it is demoted to secondary and fast-forwarded past its unreplicated writes (which are discarded or reconciled). An automated fencing mechanism (revoke old primary's write credentials or update DNS before completing promotion) prevents simultaneous dual-primary writes.

Post-Failover Sync

After failover, the original primary (now rejoining as secondary) must be re-synchronized. It streams from the new primary starting at the highest common LSN. A sync_post_failover job tracks the catch-up progress and marks the rejoined region as a healthy secondary once it falls within lag thresholds.

Replication Filters

Not all tables need to be replicated. Filters exclude:

  • Temporary or session tables that are irrelevant to secondaries.
  • High-volume tables (e.g., audit logs, metrics) where data loss on failover is acceptable and replication overhead is undesirable.
  • Tables with data that must stay in the primary region for compliance reasons.

SQL Schema

CREATE TABLE ReplicationChannel (
    id               SERIAL PRIMARY KEY,
    primary_region   TEXT NOT NULL,
    secondary_region TEXT NOT NULL,
    status           TEXT NOT NULL DEFAULT 'active' CHECK (status IN ('active','lagging','failed','rebuilding')),
    apply_lsn        BIGINT NOT NULL DEFAULT 0,
    primary_lsn      BIGINT NOT NULL DEFAULT 0,
    lag_seconds      INT NOT NULL DEFAULT 0,
    last_seen_at     TIMESTAMPTZ NOT NULL DEFAULT now(),
    UNIQUE (primary_region, secondary_region)
);

CREATE TABLE FailoverEvent (
    id              SERIAL PRIMARY KEY,
    from_region     TEXT NOT NULL,
    to_region       TEXT NOT NULL,
    trigger_reason  TEXT NOT NULL,
    started_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
    completed_at    TIMESTAMPTZ,
    data_loss_bytes BIGINT
);

CREATE INDEX idx_channel_status ON ReplicationChannel (status, lag_seconds);

Python Implementation Sketch

import time

class CrossRegionReplicationManager:
    def __init__(self, db, dns_provider):
        self.db = db
        self.dns = dns_provider
        self.lag_alert_threshold = 60   # seconds
        self.failover_lag_threshold = 300  # auto-failover if primary unreachable and lag  int:
        channel = self.db.fetchone("SELECT * FROM ReplicationChannel WHERE id = %s", (channel_id,))
        lag = channel['lag_seconds']
        if lag > self.lag_alert_threshold:
            self._alert(f"Replication lag {lag}s exceeds threshold on channel {channel_id}")
        if lag > self.lag_alert_threshold * 2:
            self.db.execute(
                "UPDATE ReplicationChannel SET status = 'lagging' WHERE id = %s", (channel_id,)
            )
        return lag

    def trigger_failover(self, primary_region: str, secondary_region: str) -> int:
        event_id = self.db.fetchone(
            "INSERT INTO FailoverEvent (from_region, to_region, trigger_reason) VALUES (%s, %s, %s) RETURNING id",
            (primary_region, secondary_region, 'primary_unreachable')
        )['id']
        # Fence old primary: revoke write credentials or update security group
        self._fence_primary(primary_region)
        # Promote secondary
        self._promote_secondary(secondary_region)
        # Update DNS to point to new primary
        self.dns.update_record('db.primary.example.com', secondary_region)
        # Update channel metadata
        self.db.execute(
            "UPDATE ReplicationChannel SET status = 'active', primary_region = %s, secondary_region = %s WHERE primary_region = %s AND secondary_region = %s",
            (secondary_region, primary_region, primary_region, secondary_region)
        )
        self.db.execute(
            "UPDATE FailoverEvent SET completed_at = now() WHERE id = %s", (event_id,)
        )
        return event_id

    def sync_post_failover(self, rejoining_region: str):
        # Find the channel where rejoining_region is now the secondary
        channel = self.db.fetchone(
            "SELECT * FROM ReplicationChannel WHERE secondary_region = %s", (rejoining_region,)
        )
        # Stream from new primary starting at channel apply_lsn
        # Monitor until lag falls within threshold
        while True:
            lag = self.monitor_lag(channel['id'])
            if lag < self.lag_alert_threshold:
                self.db.execute(
                    "UPDATE ReplicationChannel SET status = 'active' WHERE id = %s", (channel['id'],)
                )
                break
            time.sleep(10)

    def _fence_primary(self, region: str):
        pass  # Revoke write access: update firewall rules or rotate credentials

    def _promote_secondary(self, region: str):
        pass  # Execute promotion command on secondary DB: pg_promote() or equivalent

    def _alert(self, message: str):
        print(f"ALERT: {message}")

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What is the difference between async and sync cross-region replication?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Async replication acknowledges writes to the client before they are applied to secondaries. This gives lower write latency but RPO > 0 — if the primary fails, the most recent uncommitted writes are lost. Sync replication waits for at least one secondary to confirm the write before acknowledging the client. This gives RPO = 0 but increases write latency by the round-trip time to the secondary, which is typically 10-100ms for cross-region links.”
}
},
{
“@type”: “Question”,
“name”: “What is the difference between RPO and RTO in cross-region replication?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “RPO (Recovery Point Objective) is the maximum acceptable data loss measured in time — how old the most recent recoverable data can be. In async replication, RPO equals the replication lag at the time of failure. RTO (Recovery Time Objective) is the maximum acceptable downtime — how long the failover process takes. Automated failover with pre-warmed secondaries can achieve RTO of seconds to minutes. RPO and RTO are independent: you can have low RTO (fast failover) but high RPO (significant data loss) with async replication.”
}
},
{
“@type”: “Question”,
“name”: “How are conflicts handled when a failed primary rejoins after failover?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The old primary is fenced (write credentials revoked, firewall updated, or DNS removed) before the new primary accepts writes. This prevents dual-primary writes. When the old primary rejoins as a secondary, it identifies the highest LSN that was successfully applied to the new primary. Any transactions above that LSN on the old primary are discarded (they were never replicated). The old primary then replicates forward from the new primary to catch up.”
}
},
{
“@type”: “Question”,
“name”: “What DNS TTL is appropriate for cross-region failover?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “DNS TTL should be set low enough that clients pick up the updated record quickly after failover, but not so low that DNS infrastructure is overwhelmed. A TTL of 30-60 seconds is a common compromise for database endpoints. Some teams pre-lower the TTL (e.g., to 30s) during planned maintenance windows. For emergency failover with a high TTL, clients that cached the old record will continue routing to the failed primary until TTL expires, extending effective RTO.”
}
}
]
}

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How is async replication lag measured and bounded?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Replication lag is measured as the difference between the primary's current log sequence number (LSN) and the highest LSN confirmed applied at each replica, expressed in both bytes and wall-clock time. Lag is bounded by setting an alert threshold and by using semi-synchronous replication for critical data paths, requiring at least one remote replica to acknowledge a write before the primary commits it.”
}
},
{
“@type”: “Question”,
“name”: “How are write conflicts resolved in cross-region replication?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Conflict resolution strategies include last-write-wins (LWW) using hybrid logical clocks (HLC) to establish causal ordering, application-level merge functions (e.g., CRDTs for counters and sets), or routing all writes for a given key to a designated home region to eliminate conflicts by design. Systems that allow multi-master writes record both conflicting versions and surface them to application-layer conflict resolvers rather than silently discarding data.”
}
},
{
“@type”: “Question”,
“name”: “How does regional failover work?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “On detecting primary-region unavailability (via health checks or failure detectors), a global traffic manager (e.g., Route 53 with latency-based routing or a global load balancer) updates DNS to point traffic to a secondary region that has been promoted to primary. The promotion process involves fencing the old primary (STONITH or lease expiry) to prevent split-brain, then advancing the replica to primary state and resuming writes.”
}
},
{
“@type”: “Question”,
“name”: “How is RPO enforced during region failure?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “RPO (Recovery Point Objective) is enforced by combining an asynchronous replication lag SLA with a synchronous commit requirement for data classified as critical: writes are only acknowledged to the client after at least one cross-region replica confirms durability, ensuring zero data loss on failover for those writes. For less critical data, RPO is bounded by continuous monitoring of replication lag and triggering alerts or throttling writes when lag exceeds the RPO target.”
}
}
]
}

See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering

See also: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence

See also: Anthropic Interview Guide 2026: Process, Questions, and AI Safety

Scroll to Top