System Design: Multi-Region Architecture — Active-Active, Active-Passive, Data Replication, Failover, RPO, RTO

Multi-region architecture distributes your application across multiple geographic regions to achieve low latency for global users, high availability during regional outages, and compliance with data residency regulations. This is one of the most complex system design topics — and one of the most frequently asked at senior levels. This guide covers active-active vs active-passive patterns, data replication strategies, failover mechanisms, and disaster recovery planning.

Active-Passive vs Active-Active

Active-passive: one region (primary) handles all traffic. A second region (standby) receives replicated data but serves no traffic during normal operation. If the primary fails, traffic is redirected to the standby (failover). Pros: simpler architecture (no write conflict resolution), lower cost (standby uses minimal compute). Cons: failover is not instant (minutes to detect failure + switch DNS), the standby region may have stale data (replication lag), and global users have high latency to the single active region. Active-active: multiple regions simultaneously handle traffic. Users are routed to the nearest region via GeoDNS or anycast. Both regions accept reads and writes. Pros: low latency for global users, no single point of failure (one region fails, the other continues), full utilization of infrastructure in all regions. Cons: write conflict resolution (two users update the same record in different regions simultaneously), more complex data replication, higher cost (full infrastructure in each region). Decision: use active-passive for cost-sensitive applications where minutes of downtime during failover are acceptable. Use active-active for global applications requiring low latency and continuous availability.

Data Replication Strategies

Synchronous replication: every write is committed in all regions before acknowledging the client. Guarantees zero data loss (RPO = 0). But adds inter-region latency to every write (50-200ms round-trip between US East and Europe). Unacceptable for most applications. Used by: Google Spanner (uses TrueTime to achieve global strong consistency with acceptable latency). Asynchronous replication: writes are committed locally and replicated to other regions asynchronously. The client receives acknowledgment after the local commit. Replication lag: typically 100ms-2s. If the primary region fails, unacknowledged writes are lost. RPO = replication lag. Used by: most multi-region deployments (AWS RDS cross-region read replicas, DynamoDB Global Tables). Semi-synchronous: writes are committed locally and to at least one remote region before acknowledging. Provides RPO close to 0 while limiting the latency impact to one inter-region round-trip. Used by: MySQL semi-synchronous replication. Conflict resolution for active-active: when two regions write to the same record, resolve conflicts using: last-writer-wins (timestamp-based, simple but may lose data), custom resolution logic (application-specific merge), or CRDTs (Conflict-free Replicated Data Types — data structures that merge automatically without conflicts).

RPO and RTO: Recovery Objectives

Recovery Point Objective (RPO): the maximum amount of data loss you can tolerate, measured in time. RPO = 1 hour means you can afford to lose up to 1 hour of data. RPO = 0 means no data loss is acceptable. RPO determines your replication strategy: RPO = 0 requires synchronous replication (expensive, high latency). RPO = minutes requires asynchronous replication with frequent checkpoints. RPO = hours can use periodic database backups. Recovery Time Objective (RTO): the maximum time the system can be down, measured from the start of the outage to full recovery. RTO = 0 means the system must never be unavailable (requires active-active). RTO = minutes requires automated failover (pre-provisioned standby, health checks, DNS switching). RTO = hours allows manual failover (wake up the on-call engineer, run a runbook). Cost increases dramatically as RPO and RTO approach zero. A system with RPO=0 and RTO=0 requires active-active with synchronous replication across regions — the most expensive architecture. Most applications can tolerate RPO=minutes and RTO=minutes, which is achievable with asynchronous replication and automated failover at a fraction of the cost.

Failover Mechanisms

DNS-based failover: the DNS record for your service points to the primary region. Health checks monitor the primary. If the primary fails, the DNS record is updated to point to the standby region. Failover time = health check detection time + DNS TTL propagation. With a 60-second TTL and 30-second health check interval: failover in approximately 90 seconds. AWS Route 53 health checks with failover routing implement this natively. Application-level failover: the application client (or API gateway) maintains a list of regional endpoints. If the primary endpoint returns errors, the client automatically retries against the secondary endpoint. Faster than DNS failover (immediate, no TTL delay) but requires client cooperation. Database failover: AWS RDS Multi-AZ provides automatic failover within a region (1-2 minutes). Cross-region failover requires promoting a read replica to primary — this is a manual or scripted operation that takes 5-15 minutes. DynamoDB Global Tables provide automatic multi-region failover with active-active writes. Testing failover: regularly test failover procedures in production (chaos engineering). A failover mechanism that has never been tested will fail when you need it most. Netflix runs regional failover drills weekly.

Data Residency and Compliance

Regulations like GDPR (EU), LGPD (Brazil), and PIPL (China) require personal data to be stored in specific geographic regions. Implementation: (1) Regional data partitioning — shard data by user region. EU user data is stored in the EU region, US data in the US region. The application routes requests to the correct region based on the user profile. (2) Data processing locality — not only must data be stored in the region, but it must also be processed there. A recommendation engine analyzing EU user data must run in the EU, even if the results are served globally. (3) Cross-border transfer restrictions — some regulations restrict transferring personal data outside the region. If your analytics pipeline aggregates data globally, EU user data cannot be sent to a US analytics cluster without explicit legal basis (Standard Contractual Clauses, adequacy decisions). Technical implementation: use a region field on every user record. Database queries include a region filter. The application layer routes to the correct regional database. A metadata service maps user IDs to regions for routing. Audit logging tracks all data access by region for compliance reporting.

{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”What is the difference between active-active and active-passive multi-region?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Active-passive: one region handles all traffic (primary). A second region receives replicated data but serves no traffic (standby). On primary failure, traffic switches to the standby. Pros: simpler (no write conflicts), lower cost (standby uses minimal compute). Cons: failover takes minutes (detection + DNS propagation), standby may have stale data (replication lag), global users have high latency to the single active region. Active-active: both regions simultaneously handle traffic. Users are routed to the nearest region. Both accept reads and writes. Pros: low latency for global users, no single point of failure, full infrastructure utilization. Cons: write conflict resolution (two regions update the same record), complex replication, higher cost. Choose active-passive when minutes of downtime during failover are acceptable and most users are in one region. Choose active-active for global applications requiring low latency and near-zero downtime.”}},{“@type”:”Question”,”name”:”What are RPO and RTO and how do they affect architecture decisions?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Recovery Point Objective (RPO): maximum acceptable data loss measured in time. RPO = 1 hour means you can lose up to 1 hour of writes. RPO = 0 means no data loss. RPO determines replication strategy: RPO = 0 needs synchronous replication (expensive, adds inter-region latency to every write). RPO = minutes needs asynchronous replication. RPO = hours can use periodic backups. Recovery Time Objective (RTO): maximum acceptable downtime. RTO = 0 needs active-active (no failover delay). RTO = minutes needs automated failover with pre-provisioned standby. RTO = hours allows manual failover. Cost increases dramatically as RPO and RTO approach zero. Most applications can tolerate RPO = minutes and RTO = minutes, achievable with asynchronous replication and automated DNS failover at reasonable cost. A system with RPO = 0 and RTO = 0 requires active-active with synchronous replication — the most expensive architecture.”}},{“@type”:”Question”,”name”:”How does DNS-based failover work for multi-region systems?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”DNS-based failover uses health checks to detect regional failures and updates DNS records to redirect traffic. Setup: (1) Configure health checks that probe your primary region endpoint every 10-30 seconds. (2) Set up a failover DNS policy: the primary region IP is returned by default. If the health check fails, return the secondary region IP. Failover time = health check detection time (30-60 seconds) + DNS TTL propagation (60 seconds typical) = approximately 90-120 seconds total. AWS Route 53 implements this natively with health checks and failover routing policies. Optimization: use a short DNS TTL (60 seconds) for faster failover at the cost of more DNS queries. Use health checks that verify application health (HTTP 200 from /health), not just network reachability (TCP ping). Important limitation: DNS failover depends on clients respecting TTLs. Some clients cache DNS records longer than the TTL. In practice, expect 2-5% of traffic to continue hitting the failed region for minutes after the DNS switch. For faster failover, use application-level routing (the client maintains a list of regional endpoints and switches on errors) or anycast-based routing.”}},{“@type”:”Question”,”name”:”How do you handle data replication conflicts in active-active multi-region?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”In active-active, two regions may update the same record simultaneously, creating a conflict. Resolution strategies: (1) Last-writer-wins (LWW) — the write with the latest timestamp wins. Simple but can lose data silently (the first write is discarded). DynamoDB Global Tables uses LWW by default. Requires synchronized clocks across regions (NTP accuracy is typically 1-10ms). (2) Application-level merge — define custom merge logic per data type. For a shopping cart: merge by taking the union of items from both versions. For a counter: sum the increments from both regions. Requires domain-specific knowledge. (3) CRDTs (Conflict-free Replicated Data Types) — data structures mathematically guaranteed to merge without conflicts. G-Counter (grow-only counter), OR-Set (observed-remove set), LWW-Register. Used by Redis CRDT and Riak. (4) Conflict avoidance — partition writes by region. EU users write to the EU database, US users write to the US database. Conflicts only occur if a user changes regions. The simplest practical approach when users are geographically stable.”}}]}
Scroll to Top