Low Level Design: Cross-Region Failover

⏱ 3 min read

Cross-region failover reroutes traffic from a failed primary region to a healthy secondary region. The failover must be fast (under 30 seconds for automated failover), complete (all traffic moves, not just a subset), and safe (no data corruption from split-brain). Designing correct failover requires addressing detection latency, data replication state, DNS propagation, and application-level failover awareness.

Failure Detection

Detect regional failure with: health check polling from multiple external vantage points (if only one vantage point declares failure, it may be a false positive due to network path issues), synthetic transaction monitoring (end-to-end tests that mimic real user flows — more accurate than server-level health checks), and metric anomaly detection (error rate spike, latency spike, drastic traffic drop). Require agreement from N of M health check locations before declaring a region failed, to avoid false positives that trigger unnecessary failovers.

DNS-Based Failover

DNS-based failover updates DNS records to point to the secondary region. AWS Route 53 health checks: if the primary endpoint fails N consecutive checks, Route 53 automatically switches the DNS response to the secondary endpoint. DNS TTL (typically 30-60s for failover-sensitive records) determines propagation delay. Clients with cached DNS entries continue routing to the failed region until their cached TTL expires — DNS failover takes TTL seconds to complete from each client's perspective, not from the moment DNS is updated.

Anycast for Faster Failover

Anycast routing (used by Cloudflare, AWS Global Accelerator) provides faster failover than DNS-based approaches. The same IP address is advertised from multiple regions via BGP. When a region fails, BGP withdraws the route advertisement from that region; traffic reroutes to the next nearest region within seconds (BGP convergence time). Unlike DNS, anycast failover does not depend on client-side TTL expiry — the routing change is in the network fabric, not in client DNS caches.

Database Failover

The hardest part of regional failover is database failover. Asynchronous replication means the secondary may be behind the primary at the time of failure. The replication lag defines the RPO (Recovery Point Objective) — the amount of data that may be lost. For zero-RPO failover, use synchronous replication: every write is acknowledged only after the secondary confirms receipt. This adds cross-region round-trip latency to every write. The tradeoff: lower RPO costs write latency; higher RPO reduces write latency.

Split-Brain Prevention

Split-brain occurs when both the primary and secondary regions believe they are the active region and accept writes simultaneously. Result: data divergence that is difficult or impossible to reconcile. Prevent split-brain with: fencing (a coordination service like etcd ensures only one region holds the write token at a time), primary shutdown before secondary promotion (STONITH — Shoot The Other Node In The Head), or leases (the primary must renew its lease on a coordination service; if the primary is isolated, the lease expires and the secondary promotes itself).

Application-Level Failover

Application-level awareness of failover reduces RTO. Stateless services restart instantly in the secondary region (Kubernetes reschedules pods). Stateful services (connections, sessions, in-memory state) require reconnection after failover. Use database connection strings that resolve to the current primary (AWS RDS cluster endpoint, Route 53 CNAME that follows the database failover). Applications that hardcode IP addresses fail during failover. Session state stored in replicated Redis survives failover; in-process session state is lost.

Failover Runbook and Testing

Document and test the failover procedure before an actual failure. Runbook: detection criteria (what metrics indicate a regional failure), decision authority (who approves failover initiation), step-by-step commands, expected behavior at each step, verification steps (how to confirm failover is complete and traffic is flowing), and rollback procedure (how to fail back to the primary once it recovers). Test failover during low-traffic periods at minimum quarterly — untested failover procedures inevitably fail when executed under stress.