Question 1

What is the difference between async and sync cross-region replication?

Accepted Answer

Async replication acknowledges writes to the client before they are applied to secondaries. This gives lower write latency but RPO > 0 — if the primary fails, the most recent uncommitted writes are lost. Sync replication waits for at least one secondary to confirm the write before acknowledging the client. This gives RPO = 0 but increases write latency by the round-trip time to the secondary, which is typically 10-100ms for cross-region links.

Question 2

What is the difference between RPO and RTO in cross-region replication?

Accepted Answer

RPO (Recovery Point Objective) is the maximum acceptable data loss measured in time — how old the most recent recoverable data can be. In async replication, RPO equals the replication lag at the time of failure. RTO (Recovery Time Objective) is the maximum acceptable downtime — how long the failover process takes. Automated failover with pre-warmed secondaries can achieve RTO of seconds to minutes. RPO and RTO are independent: you can have low RTO (fast failover) but high RPO (significant data loss) with async replication.

Question 3

How are conflicts handled when a failed primary rejoins after failover?

Accepted Answer

The old primary is fenced (write credentials revoked, firewall updated, or DNS removed) before the new primary accepts writes. This prevents dual-primary writes. When the old primary rejoins as a secondary, it identifies the highest LSN that was successfully applied to the new primary. Any transactions above that LSN on the old primary are discarded (they were never replicated). The old primary then replicates forward from the new primary to catch up.

Question 4

What DNS TTL is appropriate for cross-region failover?

Accepted Answer

DNS TTL should be set low enough that clients pick up the updated record quickly after failover, but not so low that DNS infrastructure is overwhelmed. A TTL of 30-60 seconds is a common compromise for database endpoints. Some teams pre-lower the TTL (e.g., to 30s) during planned maintenance windows. For emergency failover with a high TTL, clients that cached the old record will continue routing to the failed primary until TTL expires, extending effective RTO.

Question 5

How is async replication lag measured and bounded?

Accepted Answer

Replication lag is measured as the difference between the primary's current log sequence number (LSN) and the highest LSN confirmed applied at each replica, expressed in both bytes and wall-clock time. Lag is bounded by setting an alert threshold and by using semi-synchronous replication for critical data paths, requiring at least one remote replica to acknowledge a write before the primary commits it.

Question 6

How are write conflicts resolved in cross-region replication?

Accepted Answer

Conflict resolution strategies include last-write-wins (LWW) using hybrid logical clocks (HLC) to establish causal ordering, application-level merge functions (e.g., CRDTs for counters and sets), or routing all writes for a given key to a designated home region to eliminate conflicts by design. Systems that allow multi-master writes record both conflicting versions and surface them to application-layer conflict resolvers rather than silently discarding data.

Question 7

How does regional failover work?

Accepted Answer

On detecting primary-region unavailability (via health checks or failure detectors), a global traffic manager (e.g., Route 53 with latency-based routing or a global load balancer) updates DNS to point traffic to a secondary region that has been promoted to primary. The promotion process involves fencing the old primary (STONITH or lease expiry) to prevent split-brain, then advancing the replica to primary state and resuming writes.

Question 8

How is RPO enforced during region failure?

Accepted Answer

RPO (Recovery Point Objective) is enforced by combining an asynchronous replication lag SLA with a synchronous commit requirement for data classified as critical: writes are only acknowledged to the client after at least one cross-region replica confirms durability, ensuring zero data loss on failover for those writes. For less critical data, RPO is bounded by continuous monitoring of replication lag and triggering alerts or throttling writes when lag exceeds the RPO target.

Cross-Region Replication Low-Level Design: Async Replication, Conflict Resolution, and Failover

Why Cross-Region Replication?

Async Replication

Replication Channel

Replication Lag Monitoring

Read Routing

Failover and Conflict on Promotion

Post-Failover Sync

Replication Filters

SQL Schema

Python Implementation Sketch