Low Level Design: Multi-Region Architecture

A multi-region architecture deploys services and data across multiple geographic regions to achieve low latency for global users, disaster recovery, and compliance with data residency requirements. The core challenges are data replication across regions, handling regional failures, and routing users to the optimal region.

Active-Active vs Active-Passive

Active-active: all regions serve traffic simultaneously. Each region handles a portion of users (geographic routing). Data writes from each region must be replicated to others. Conflicts are possible when the same record is updated in two regions simultaneously. Active-passive: one primary region handles all writes; secondary regions serve reads and take over on primary failure. Active-active has better utilization and lower read latency; active-passive is simpler to reason about for consistency.

Geographic Traffic Routing

DNS-based geographic routing (Route 53 Geolocation, Cloudflare GeoDNS) directs users to the nearest region based on IP geolocation. Latency-based routing selects the region with the lowest measured RTT to the user, which may differ from geographic proximity due to network topology. Health checks automatically remove unhealthy regions from DNS responses, providing failover without manual intervention (though DNS TTL limits failover speed to seconds-to-minutes).

Cross-Region Data Replication

Replicate data asynchronously between regions to avoid adding cross-region latency to write operations. Replication lag is typically 100-500ms for inter-continental regions. Asynchronous replication means a region failover may lose recently written data (RPO > 0). For zero RPO, use synchronous multi-region writes (strong consistency) at the cost of write latency equal to the cross-region round-trip. CockroachDB and Google Spanner provide synchronous multi-region transactions.

Conflict Resolution

In active-active, concurrent writes to the same record in different regions create conflicts. Resolution strategies: last-write-wins (LWW) using timestamps (simple but can lose data if clocks skew), CRDTs (Conflict-free Replicated Data Types, automatic mathematical merge for counters, sets, sequences), application-level resolution (present conflict to user), or region ownership (each record has a home region that owns its writes, routing writes cross-region if needed).

Data Residency and Compliance

GDPR and regional regulations may require user data to remain within specific geographic boundaries. Implement region-pinned storage: EU users' data is stored only in EU regions and never replicated outside. The user's home region is determined at account creation based on IP or explicit selection. Cross-region reads for EU users must route to EU region; cross-region replication of EU data to non-EU regions is prohibited. Audit logs track data location compliance.

Regional Failure Handling

Design for regional isolation: failures in one region should not cascade to others. Use separate VPCs, separate cloud accounts, or separate availability zones per region. Circuit breakers on cross-region calls prevent a degraded region from slowing down healthy regions. On regional failure, DNS failover redirects traffic to healthy regions. Test regional failover periodically with chaos engineering experiments (disable a region during low-traffic periods).

Global Load Balancing

A global load balancer (Cloudflare Load Balancing, AWS Global Accelerator, GCP Cloud Load Balancing) sits above regional load balancers. It health-checks each region and distributes traffic by weight, latency, or failover priority. Unlike DNS-based routing, global load balancers use anycast to route TCP connections to the nearest PoP, reducing connection setup latency and providing sub-second failover without DNS TTL delays.

Multi-Region Deployment Pipeline

Deploy to regions sequentially (canary region first, then staged rollout) to catch region-specific issues before global rollout. Deployment order: canary region (low traffic, typically a secondary region) → primary region → all secondary regions. Each stage requires monitoring validation before proceeding. Schema migrations must be backward-compatible with both old and new code versions deployed in different regions simultaneously.

Scroll to Top