Why Cross-Region Replication?
Cross-region replication serves three primary goals: disaster recovery (survive a full regional outage), lower read latency for globally distributed users (reads served from nearest region), and data residency compliance (keep certain data within specific geographic boundaries).
Each goal has different requirements. DR prioritizes failover automation and RPO/RTO targets. Latency optimization prioritizes routing reads to the nearest healthy replica. Residency compliance prioritizes data filtering and audit trails.
Async Replication
In asynchronous replication, the primary region commits writes locally and returns success to the client before data is replicated to secondary regions. Replication happens in the background over a dedicated channel.
The replication lag — the time between a commit on the primary and its application on the secondary — is typically seconds to minutes depending on network bandwidth, primary write volume, and secondary apply throughput. Under heavy write load, lag can grow to hours if the secondary cannot keep up.
Replication Channel
Two common mechanisms for shipping changes to secondaries:
- WAL streaming: the DB engine streams its write-ahead log to the secondary, which applies log records in order. Used by PostgreSQL streaming replication. Ordering is guaranteed; schema must match.
- CDC via Kafka: a CDC connector captures row-level changes from the primary and publishes them to a Kafka topic. The secondary consumes the topic and applies changes. More flexible — secondary can have a different schema or even a different DB engine.
Replication Lag Monitoring
Lag is monitored by comparing positions:
- WAL-based:
lag_bytes = primary_lsn - secondary_apply_lsn. Convert to seconds using write rate. - CDC-based:
lag_seconds = primary_event_timestamp - latest_applied_event_timestamp.
Alert when lag exceeds the RPO target. A replication channel record tracks current lag per region pair and triggers alerts or automated failover preparation when thresholds are crossed.
Read Routing
Reads are routed to the geographically nearest region. Staleness is acceptable for most reads (user profiles, product catalog). Writes always go to the primary region. Applications that require read-your-own-writes consistency must either route reads to the primary or implement a write-propagation wait before redirecting to the local secondary.
Failover and Conflict on Promotion
When the primary region fails, the secondary is promoted to primary. The key risk:
- Committed but unreplicated transactions on the old primary are lost (RPO > 0).
- If the old primary recovers later, it may have writes that the new primary does not — a split-brain scenario.
Mitigation: on promotion, the new primary records the highest applied LSN/offset. When the old primary rejoins, it is demoted to secondary and fast-forwarded past its unreplicated writes (which are discarded or reconciled). An automated fencing mechanism (revoke old primary's write credentials or update DNS before completing promotion) prevents simultaneous dual-primary writes.
Post-Failover Sync
After failover, the original primary (now rejoining as secondary) must be re-synchronized. It streams from the new primary starting at the highest common LSN. A sync_post_failover job tracks the catch-up progress and marks the rejoined region as a healthy secondary once it falls within lag thresholds.
Replication Filters
Not all tables need to be replicated. Filters exclude:
- Temporary or session tables that are irrelevant to secondaries.
- High-volume tables (e.g., audit logs, metrics) where data loss on failover is acceptable and replication overhead is undesirable.
- Tables with data that must stay in the primary region for compliance reasons.
SQL Schema
CREATE TABLE ReplicationChannel (
id SERIAL PRIMARY KEY,
primary_region TEXT NOT NULL,
secondary_region TEXT NOT NULL,
status TEXT NOT NULL DEFAULT 'active' CHECK (status IN ('active','lagging','failed','rebuilding')),
apply_lsn BIGINT NOT NULL DEFAULT 0,
primary_lsn BIGINT NOT NULL DEFAULT 0,
lag_seconds INT NOT NULL DEFAULT 0,
last_seen_at TIMESTAMPTZ NOT NULL DEFAULT now(),
UNIQUE (primary_region, secondary_region)
);
CREATE TABLE FailoverEvent (
id SERIAL PRIMARY KEY,
from_region TEXT NOT NULL,
to_region TEXT NOT NULL,
trigger_reason TEXT NOT NULL,
started_at TIMESTAMPTZ NOT NULL DEFAULT now(),
completed_at TIMESTAMPTZ,
data_loss_bytes BIGINT
);
CREATE INDEX idx_channel_status ON ReplicationChannel (status, lag_seconds);
Python Implementation Sketch
import time
class CrossRegionReplicationManager:
def __init__(self, db, dns_provider):
self.db = db
self.dns = dns_provider
self.lag_alert_threshold = 60 # seconds
self.failover_lag_threshold = 300 # auto-failover if primary unreachable and lag int:
channel = self.db.fetchone("SELECT * FROM ReplicationChannel WHERE id = %s", (channel_id,))
lag = channel['lag_seconds']
if lag > self.lag_alert_threshold:
self._alert(f"Replication lag {lag}s exceeds threshold on channel {channel_id}")
if lag > self.lag_alert_threshold * 2:
self.db.execute(
"UPDATE ReplicationChannel SET status = 'lagging' WHERE id = %s", (channel_id,)
)
return lag
def trigger_failover(self, primary_region: str, secondary_region: str) -> int:
event_id = self.db.fetchone(
"INSERT INTO FailoverEvent (from_region, to_region, trigger_reason) VALUES (%s, %s, %s) RETURNING id",
(primary_region, secondary_region, 'primary_unreachable')
)['id']
# Fence old primary: revoke write credentials or update security group
self._fence_primary(primary_region)
# Promote secondary
self._promote_secondary(secondary_region)
# Update DNS to point to new primary
self.dns.update_record('db.primary.example.com', secondary_region)
# Update channel metadata
self.db.execute(
"UPDATE ReplicationChannel SET status = 'active', primary_region = %s, secondary_region = %s WHERE primary_region = %s AND secondary_region = %s",
(secondary_region, primary_region, primary_region, secondary_region)
)
self.db.execute(
"UPDATE FailoverEvent SET completed_at = now() WHERE id = %s", (event_id,)
)
return event_id
def sync_post_failover(self, rejoining_region: str):
# Find the channel where rejoining_region is now the secondary
channel = self.db.fetchone(
"SELECT * FROM ReplicationChannel WHERE secondary_region = %s", (rejoining_region,)
)
# Stream from new primary starting at channel apply_lsn
# Monitor until lag falls within threshold
while True:
lag = self.monitor_lag(channel['id'])
if lag < self.lag_alert_threshold:
self.db.execute(
"UPDATE ReplicationChannel SET status = 'active' WHERE id = %s", (channel['id'],)
)
break
time.sleep(10)
def _fence_primary(self, region: str):
pass # Revoke write access: update firewall rules or rotate credentials
def _promote_secondary(self, region: str):
pass # Execute promotion command on secondary DB: pg_promote() or equivalent
def _alert(self, message: str):
print(f"ALERT: {message}")
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering
See also: Anthropic Interview Guide 2026: Process, Questions, and AI Safety