Question 1

What is the difference between active-active and active-passive multi-region?

Accepted Answer

Active-passive: one region handles all traffic (primary). A second region receives replicated data but serves no traffic (standby). On primary failure, traffic switches to the standby. Pros: simpler (no write conflicts), lower cost (standby uses minimal compute). Cons: failover takes minutes (detection + DNS propagation), standby may have stale data (replication lag), global users have high latency to the single active region. Active-active: both regions simultaneously handle traffic. Users are routed to the nearest region. Both accept reads and writes. Pros: low latency for global users, no single point of failure, full infrastructure utilization. Cons: write conflict resolution (two regions update the same record), complex replication, higher cost. Choose active-passive when minutes of downtime during failover are acceptable and most users are in one region. Choose active-active for global applications requiring low latency and near-zero downtime.

Question 2

What are RPO and RTO and how do they affect architecture decisions?

Accepted Answer

Recovery Point Objective (RPO): maximum acceptable data loss measured in time. RPO = 1 hour means you can lose up to 1 hour of writes. RPO = 0 means no data loss. RPO determines replication strategy: RPO = 0 needs synchronous replication (expensive, adds inter-region latency to every write). RPO = minutes needs asynchronous replication. RPO = hours can use periodic backups. Recovery Time Objective (RTO): maximum acceptable downtime. RTO = 0 needs active-active (no failover delay). RTO = minutes needs automated failover with pre-provisioned standby. RTO = hours allows manual failover. Cost increases dramatically as RPO and RTO approach zero. Most applications can tolerate RPO = minutes and RTO = minutes, achievable with asynchronous replication and automated DNS failover at reasonable cost. A system with RPO = 0 and RTO = 0 requires active-active with synchronous replication -- the most expensive architecture.

Question 3

How does DNS-based failover work for multi-region systems?

Accepted Answer

DNS-based failover uses health checks to detect regional failures and updates DNS records to redirect traffic. Setup: (1) Configure health checks that probe your primary region endpoint every 10-30 seconds. (2) Set up a failover DNS policy: the primary region IP is returned by default. If the health check fails, return the secondary region IP. Failover time = health check detection time (30-60 seconds) + DNS TTL propagation (60 seconds typical) = approximately 90-120 seconds total. AWS Route 53 implements this natively with health checks and failover routing policies. Optimization: use a short DNS TTL (60 seconds) for faster failover at the cost of more DNS queries. Use health checks that verify application health (HTTP 200 from /health), not just network reachability (TCP ping). Important limitation: DNS failover depends on clients respecting TTLs. Some clients cache DNS records longer than the TTL. In practice, expect 2-5% of traffic to continue hitting the failed region for minutes after the DNS switch. For faster failover, use application-level routing (the client maintains a list of regional endpoints and switches on errors) or anycast-based routing.

Question 4

How do you handle data replication conflicts in active-active multi-region?

Accepted Answer

In active-active, two regions may update the same record simultaneously, creating a conflict. Resolution strategies: (1) Last-writer-wins (LWW) -- the write with the latest timestamp wins. Simple but can lose data silently (the first write is discarded). DynamoDB Global Tables uses LWW by default. Requires synchronized clocks across regions (NTP accuracy is typically 1-10ms). (2) Application-level merge -- define custom merge logic per data type. For a shopping cart: merge by taking the union of items from both versions. For a counter: sum the increments from both regions. Requires domain-specific knowledge. (3) CRDTs (Conflict-free Replicated Data Types) -- data structures mathematically guaranteed to merge without conflicts. G-Counter (grow-only counter), OR-Set (observed-remove set), LWW-Register. Used by Redis CRDT and Riak. (4) Conflict avoidance -- partition writes by region. EU users write to the EU database, US users write to the US database. Conflicts only occur if a user changes regions. The simplest practical approach when users are geographically stable.

System Design: Multi-Region Architecture — Active-Active, Active-Passive, Data Replication, Failover, RPO, RTO

Active-Passive vs Active-Active

Data Replication Strategies

RPO and RTO: Recovery Objectives

Failover Mechanisms

Data Residency and Compliance