Question 1

How do you route users to the nearest region in a multi-region architecture?

Accepted Answer

Two primary techniques: GeoDNS and anycast. GeoDNS resolves the same domain to different IP addresses based on the geographic location of the user's DNS resolver. AWS Route 53 latency-based routing measures actual network latency from Route 53 resolvers to each AWS region and routes each DNS query to the lowest-latency region. The routing table is continuously updated based on real latency measurements. DNS TTL (typically 60 seconds) determines how quickly clients pick up routing changes — if a region fails and Route 53 stops routing to it, clients with cached DNS responses continue hitting the failed region until TTL expires. Anycast assigns the same IP prefix to multiple data centers and uses BGP routing to deliver packets to the nearest advertising node. No DNS lookup involved — routing is at the IP layer, so failover happens in BGP convergence time (seconds, not minutes). Cloudflare uses anycast for all 300+ PoPs. Combined approach: anycast at the network layer for DDoS resilience and instant failover, with application-level GeoDNS for routing specific services to specific regions.

Question 2

How do CRDTs solve the conflict resolution problem in multi-region systems?

Accepted Answer

CRDTs (Conflict-free Replicated Data Types) are data structures designed so that concurrent updates from any region can always be merged into a consistent result, regardless of the order updates are applied. They achieve this by constraining operations to be commutative and associative — the merge of any two states produces the same result no matter which order they are merged. Example: a G-Counter (Grow-only Counter) assigns one counter slot per replica. Increment operations only affect the local replica's slot. The value is the sum of all slots. Two regions can increment simultaneously (both increment their local slot) — when they sync, the merge is simply summing all slots. No conflict possible. More complex types: OR-Set (observed-remove set) allows add and remove operations without conflict by tagging each element with a unique ID — removes only remove the specific tagged version, not concurrent adds. LWW-Register (last-write-wins register) uses timestamps to resolve concurrent writes — the highest timestamp wins. CRDT tradeoff: they constrain your data model. Not all business operations can be expressed as CRDTs — a "set account balance to X" is not CRDT-safe, but "add X to balance" is (using a G-Counter). Design data models around CRDT-friendly operations from the start.

Question 3

How do you design for regional isolation to prevent cascading failures?

Accepted Answer

Regional isolation ensures a failure in one region does not take down other regions. Key design principles: (1) Regional database autonomy: each region has its own database cluster that can operate independently. Cross-region replication is asynchronous — if the replication link fails, each region continues serving its local users from its local database. No synchronous cross-region database calls in the critical path. (2) Circuit breakers on cross-region dependencies: if service A in us-east-1 calls service B in eu-west-1 (anti-pattern — avoid this), a circuit breaker opens when cross-region calls start timing out, falling back to cached or default data. (3) Cell architecture within regions: divide each region into independent cells (typically availability zones, or logical shards). A cell failure is contained — load balancer stops routing to that cell, other cells absorb the traffic. With 3 cells per region, each cell handles ~50% more than its normal share during a cell failure — size cells with this headroom. (4) Stateless application servers: application servers hold no local state; all state is in the database. Replacing a failed AZ means just launching new EC2 instances — no data recovery needed. (5) Separate DNS health checks per region: Route 53 removes a region from rotation when health checks fail consistently (e.g., 3 consecutive failures over 30 seconds). This triggers failover to healthy regions automatically.

System Design Interview: Multi-Region Active-Active Architecture

Why Multi-Region Active-Active?

Key Design Challenges

Traffic Routing: GeoDNS and Anycast

Write Routing Strategies

Single Primary (Active-Passive with Regional Read Replicas)

Home Region Routing (Sharded by User)

Multi-Primary with Conflict Resolution

Replication Topology

Regional Isolation: Bulkheads and Cell Architecture

Disaster Recovery: RTO and RPO

Key Interview Points

Frequently Asked Questions

How do you route users to the nearest region in a multi-region architecture?

How do CRDTs solve the conflict resolution problem in multi-region systems?

How do you design for regional isolation to prevent cascading failures?

Companies That Ask This Question