Multi-Region Architecture Low-Level Design

Why Multi-Region?

Single-region deployments fail entirely during a cloud provider AZ or region outage. Multi-region provides: (1) High availability — survive a full region outage without downtime. (2) Latency — serve users from geographically close regions (50ms vs 200ms). (3) Compliance — data residency requirements (GDPR: EU data must stay in EU). (4) Disaster recovery — RTO (Recovery Time Objective) of minutes, not hours.

Architecture Patterns

Active-Passive: one primary region handles all writes; standby regions serve reads and can be promoted on failure. Simpler consistency (single write source). Failover takes 1-5 minutes (DNS propagation + health check). Writes have single-region latency.

Active-Active: all regions accept reads and writes simultaneously. Lowest latency for users globally. Requires distributed consensus or eventual consistency — concurrent writes to different regions for the same record must be resolved (last-write-wins, vector clocks, or CRDT). Used by DynamoDB Global Tables, Cassandra, CockroachDB.

Active-Active with region affinity: users are pinned to a primary region by their user_id hash. Writes from user always go to their home region. Cross-region writes happen only for shared resources. Avoids most conflict scenarios while still serving reads globally.

Data Replication

Primary Region (us-east-1):
  PostgreSQL Primary → Kafka (CDC via Debezium)
                     → Kafka MirrorMaker 2 → Secondary Region Kafka
                                           → PostgreSQL Read Replica (eu-west-1)

Secondary Region (eu-west-1):
  Read Replica: serves reads with ~100ms replication lag
  Promoted to primary on failover

CDC (Change Data Capture) via Debezium captures every DB change as a Kafka event. Replication lag is typically <1 second for writes of normal volume. Monitoring: track replication_lag metric; alert if > 30 seconds.

Global Load Balancing

Route traffic to the nearest healthy region: (1) GeoDNS: DNS returns different IPs based on the client’s geographic location. TTL=60s — failover takes up to 60s. (2) Anycast routing: announce the same IP from multiple regions; BGP routes to the nearest. Used by Cloudflare. (3) Global load balancer (AWS Global Accelerator, GCP Global LB): layer-4/7 routing with health checks; failover in <30 seconds. Production recommendation: use a global LB with health checks for API traffic; GeoDNS for static assets via CDN.

Failover Procedure

  1. Health checks detect primary region failure (3 consecutive failures over 30s)
  2. Global LB stops routing new traffic to the failed region
  3. Promote read replica in secondary region to primary (pg_promote() in PostgreSQL)
  4. Update application config to point writes to new primary
  5. Verify replication lag was < acceptable threshold before promotion (check last_wal_receive_lsn)
  6. Alert on-call team; begin RCA

RTO: 1-5 minutes. RPO (Recovery Point Objective): data up to the replication lag (typically <1s, worst case 30s if monitoring threshold).

Conflict Resolution for Active-Active

When two regions accept writes for the same record concurrently: (1) Last-Write-Wins (LWW): compare timestamps; most recent update wins. Risk: clock skew between regions can cause earlier writes to incorrectly win. Use hybrid logical clocks (HLC) instead of wall clocks. (2) Application-level merging: for counters, use CRDTs (Conflict-free Replicated Data Types) — a CRDT counter merges by taking the max of each node’s value. (3) Conflict detection + manual resolution: detect conflicts (same record modified in two regions during the same window), store both versions, application resolves on next read. Used by Amazon’s shopping cart (Dynamo paper).

Key Design Decisions

  • Active-passive for most services — simpler consistency, acceptable 1-5min failover
  • Active-active only for user-specific data where region affinity eliminates most conflicts
  • CDC via Kafka for cross-region replication — replayable, auditable, decoupled
  • Global LB with health checks — faster failover than GeoDNS alone
  • Monitor replication lag continuously — stale replica is worse than no replica

Atlassian system design covers distributed and multi-region architecture. See common questions for Atlassian interview: multi-region architecture and high availability design.

Amazon system design covers global infrastructure and multi-region architecture. Review patterns for Amazon interview: multi-region and global infrastructure design.

Databricks system design covers multi-region data replication. See design patterns for Databricks interview: multi-region data replication and consistency.

Scroll to Top