Question 1

What is RTO and RPO, and how do they shape the failover system design?

Accepted Answer

RTO (Recovery Time Objective): the maximum acceptable downtime between failure detection and service restoration. An RTO of 5 minutes means the failover system must detect the failure, promote a standby, update DNS, and confirm traffic is flowing to the new region — all within 5 minutes. This drives how fast health checks run (every 30s for 3 consecutive failures = 90s detection) and whether promotion is automated (automated = faster) or manual (manual = safer but slower). RPO (Recovery Point Objective): the maximum acceptable data loss — how many seconds or minutes of committed writes can be lost when the primary fails. An RPO of 0 requires synchronous replication (every write is committed on both primary and standby before acknowledging). An RPO of 60 seconds allows asynchronous replication with up to 60s lag. Synchronous replication guarantees zero data loss but increases write latency by one network round-trip to the standby region (typically 20–50ms for inter-region). Choose: 0s RPO = use synchronous replication + accept write latency penalty; >0s RPO = use async replication + accept potential data loss window.

Question 2

How do you prevent split-brain when the primary region is temporarily unreachable but not actually down?

Accepted Answer

Split-brain: two regions both believe they are primary and accept writes simultaneously. After 30 minutes, the "failed" primary recovers and both regions have diverged write histories — unresolvable without manual data reconciliation. Prevention: (1) quorum-based fencing: use an odd number of regions (3 or 5) and require a quorum (majority) to elect a new primary. If us-east-1 is unreachable by eu-west-1 but can still reach ap-southeast-1, it retains quorum and remains primary — eu-west-1 cannot promote alone. (2) STONITH (Shoot The Other Node In The Head): before promoting the standby, forcibly fence the old primary by shutting down its network interface (via AWS EC2 API, cloud provider control plane) so it cannot accept writes even if it recovers. (3) Conservative promotion threshold: require 3+ consecutive health check failures (90s) from multiple monitoring locations before initiating failover, not just from one location that might have a network partition.

Question 3

How do you minimize data loss when failing over with replication lag?

Accepted Answer

If failover is initiated while the standby has 30 seconds of replication lag, those 30 seconds of writes (payments, orders, user actions) are permanently lost — they were on the primary's WAL but never replicated. Minimization strategies: (1) synchronous replication: zero lag, zero data loss. Trade-off: every write waits for the standby to confirm, adding 20–50ms write latency. Use synchronous_commit=remote_write in Postgres — less strict than synchronous_commit=on but significantly faster while still protecting against data loss on primary crash; (2) wait for lag to decrease: before promoting, check lag every 5 seconds and wait up to 60 seconds for it to drop below 5 seconds. If lag doesn't improve (standby is also degraded), promote with the current lag and accept the data loss; (3) application-level replay: before promoting, read any unreplicated WAL from the primary (if it is still reachable but degraded) and apply it to the standby manually. This is complex but reduces effective data loss to near-zero even with async replication.

Question 4

How does a multi-region active-active setup differ from active-passive failover?

Accepted Answer

Active-passive (this design): one primary region handles all writes and reads; the standby region handles no production traffic until failover. Simple, strong consistency, but the standby capacity is wasted during normal operation. Active-active: both regions handle live traffic (reads and writes). Users are routed to their nearest region. Global writes must be reconciled across regions — typically via conflict-free replicated data types (CRDTs) or last-write-wins with vector clocks. Much more complex to implement. Use cases: (1) active-active for reads only (replicate writes from primary to all standbys; reads served locally) — reduces read latency globally without write complexity; (2) active-active for writes with geographic partitioning (EU users' data lives in eu-west-1, US users' in us-east-1) — no cross-region write conflicts because data is partitioned by region; (3) full active-active with CRDTs (Riak, Cassandra) — for counters and sets that can merge deterministically. For most SaaS products with strong consistency requirements, active-passive with fast automated failover is the right choice.

Question 5

How do you test your failover system without causing a real outage?

Accepted Answer

Failover systems that are never tested fail when you need them. Testing approaches: (1) scheduled failover drills: once per quarter, execute a real failover to the standby region during a low-traffic window. Measure actual RTO (time from "simulate failure" to "traffic flowing to standby"). Document gaps between expected and actual RTO. (2) chaos engineering: randomly terminate primary region services (Netflix Chaos Monkey style) in pre-production environments to verify the failover triggers correctly. (3) DNS cutover test without DB failover: update DNS to point to the standby region but keep the standby pointing to the primary database. This tests DNS propagation time without risking data loss. (4) read replica promotion test: periodically promote a read replica to primary (in a non-production environment) to practice the DB promotion procedure and measure how long it takes. The target: your entire team can execute a failover from memory within 5 minutes. If the runbook is complex, simplify until it isn't.

Multi-Region Failover System Low-Level Design: Health Monitoring, Lag-Gated Promotion, and DNS-Based Traffic Cutover

Multi-Region Failover System: Low-Level Design

Core Data Model

Health Monitoring

Automated Failover Decision Engine

Key Design Decisions