Multi-region architecture distributes your application across multiple geographic regions to achieve low latency for global users, high availability during regional outages, and compliance with data residency regulations. This is one of the most complex system design topics — and one of the most frequently asked at senior levels. This guide covers active-active vs active-passive patterns, data replication strategies, failover mechanisms, and disaster recovery planning.
Active-Passive vs Active-Active
Active-passive: one region (primary) handles all traffic. A second region (standby) receives replicated data but serves no traffic during normal operation. If the primary fails, traffic is redirected to the standby (failover). Pros: simpler architecture (no write conflict resolution), lower cost (standby uses minimal compute). Cons: failover is not instant (minutes to detect failure + switch DNS), the standby region may have stale data (replication lag), and global users have high latency to the single active region. Active-active: multiple regions simultaneously handle traffic. Users are routed to the nearest region via GeoDNS or anycast. Both regions accept reads and writes. Pros: low latency for global users, no single point of failure (one region fails, the other continues), full utilization of infrastructure in all regions. Cons: write conflict resolution (two users update the same record in different regions simultaneously), more complex data replication, higher cost (full infrastructure in each region). Decision: use active-passive for cost-sensitive applications where minutes of downtime during failover are acceptable. Use active-active for global applications requiring low latency and continuous availability.
Data Replication Strategies
Synchronous replication: every write is committed in all regions before acknowledging the client. Guarantees zero data loss (RPO = 0). But adds inter-region latency to every write (50-200ms round-trip between US East and Europe). Unacceptable for most applications. Used by: Google Spanner (uses TrueTime to achieve global strong consistency with acceptable latency). Asynchronous replication: writes are committed locally and replicated to other regions asynchronously. The client receives acknowledgment after the local commit. Replication lag: typically 100ms-2s. If the primary region fails, unacknowledged writes are lost. RPO = replication lag. Used by: most multi-region deployments (AWS RDS cross-region read replicas, DynamoDB Global Tables). Semi-synchronous: writes are committed locally and to at least one remote region before acknowledging. Provides RPO close to 0 while limiting the latency impact to one inter-region round-trip. Used by: MySQL semi-synchronous replication. Conflict resolution for active-active: when two regions write to the same record, resolve conflicts using: last-writer-wins (timestamp-based, simple but may lose data), custom resolution logic (application-specific merge), or CRDTs (Conflict-free Replicated Data Types — data structures that merge automatically without conflicts).
RPO and RTO: Recovery Objectives
Recovery Point Objective (RPO): the maximum amount of data loss you can tolerate, measured in time. RPO = 1 hour means you can afford to lose up to 1 hour of data. RPO = 0 means no data loss is acceptable. RPO determines your replication strategy: RPO = 0 requires synchronous replication (expensive, high latency). RPO = minutes requires asynchronous replication with frequent checkpoints. RPO = hours can use periodic database backups. Recovery Time Objective (RTO): the maximum time the system can be down, measured from the start of the outage to full recovery. RTO = 0 means the system must never be unavailable (requires active-active). RTO = minutes requires automated failover (pre-provisioned standby, health checks, DNS switching). RTO = hours allows manual failover (wake up the on-call engineer, run a runbook). Cost increases dramatically as RPO and RTO approach zero. A system with RPO=0 and RTO=0 requires active-active with synchronous replication across regions — the most expensive architecture. Most applications can tolerate RPO=minutes and RTO=minutes, which is achievable with asynchronous replication and automated failover at a fraction of the cost.
Failover Mechanisms
DNS-based failover: the DNS record for your service points to the primary region. Health checks monitor the primary. If the primary fails, the DNS record is updated to point to the standby region. Failover time = health check detection time + DNS TTL propagation. With a 60-second TTL and 30-second health check interval: failover in approximately 90 seconds. AWS Route 53 health checks with failover routing implement this natively. Application-level failover: the application client (or API gateway) maintains a list of regional endpoints. If the primary endpoint returns errors, the client automatically retries against the secondary endpoint. Faster than DNS failover (immediate, no TTL delay) but requires client cooperation. Database failover: AWS RDS Multi-AZ provides automatic failover within a region (1-2 minutes). Cross-region failover requires promoting a read replica to primary — this is a manual or scripted operation that takes 5-15 minutes. DynamoDB Global Tables provide automatic multi-region failover with active-active writes. Testing failover: regularly test failover procedures in production (chaos engineering). A failover mechanism that has never been tested will fail when you need it most. Netflix runs regional failover drills weekly.
Data Residency and Compliance
Regulations like GDPR (EU), LGPD (Brazil), and PIPL (China) require personal data to be stored in specific geographic regions. Implementation: (1) Regional data partitioning — shard data by user region. EU user data is stored in the EU region, US data in the US region. The application routes requests to the correct region based on the user profile. (2) Data processing locality — not only must data be stored in the region, but it must also be processed there. A recommendation engine analyzing EU user data must run in the EU, even if the results are served globally. (3) Cross-border transfer restrictions — some regulations restrict transferring personal data outside the region. If your analytics pipeline aggregates data globally, EU user data cannot be sent to a US analytics cluster without explicit legal basis (Standard Contractual Clauses, adequacy decisions). Technical implementation: use a region field on every user record. Database queries include a region filter. The application layer routes to the correct regional database. A metadata service maps user IDs to regions for routing. Audit logging tracks all data access by region for compliance reporting.