Multi-Region Architecture Low-Level Design

Why Multi-Region?

Single-region deployments fail entirely during a cloud provider AZ or region outage. Multi-region provides: (1) High availability — survive a full region outage without downtime. (2) Latency — serve users from geographically close regions (50ms vs 200ms). (3) Compliance — data residency requirements (GDPR: EU data must stay in EU). (4) Disaster recovery — RTO (Recovery Time Objective) of minutes, not hours.

Architecture Patterns

Active-Passive: one primary region handles all writes; standby regions serve reads and can be promoted on failure. Simpler consistency (single write source). Failover takes 1-5 minutes (DNS propagation + health check). Writes have single-region latency.

Active-Active: all regions accept reads and writes simultaneously. Lowest latency for users globally. Requires distributed consensus or eventual consistency — concurrent writes to different regions for the same record must be resolved (last-write-wins, vector clocks, or CRDT). Used by DynamoDB Global Tables, Cassandra, CockroachDB.

Active-Active with region affinity: users are pinned to a primary region by their user_id hash. Writes from user always go to their home region. Cross-region writes happen only for shared resources. Avoids most conflict scenarios while still serving reads globally.

Data Replication

Primary Region (us-east-1):
  PostgreSQL Primary → Kafka (CDC via Debezium)
                     → Kafka MirrorMaker 2 → Secondary Region Kafka
                                           → PostgreSQL Read Replica (eu-west-1)

Secondary Region (eu-west-1):
  Read Replica: serves reads with ~100ms replication lag
  Promoted to primary on failover

CDC (Change Data Capture) via Debezium captures every DB change as a Kafka event. Replication lag is typically <1 second for writes of normal volume. Monitoring: track replication_lag metric; alert if > 30 seconds.

Global Load Balancing

Route traffic to the nearest healthy region: (1) GeoDNS: DNS returns different IPs based on the client’s geographic location. TTL=60s — failover takes up to 60s. (2) Anycast routing: announce the same IP from multiple regions; BGP routes to the nearest. Used by Cloudflare. (3) Global load balancer (AWS Global Accelerator, GCP Global LB): layer-4/7 routing with health checks; failover in <30 seconds. Production recommendation: use a global LB with health checks for API traffic; GeoDNS for static assets via CDN.

Failover Procedure

  1. Health checks detect primary region failure (3 consecutive failures over 30s)
  2. Global LB stops routing new traffic to the failed region
  3. Promote read replica in secondary region to primary (pg_promote() in PostgreSQL)
  4. Update application config to point writes to new primary
  5. Verify replication lag was < acceptable threshold before promotion (check last_wal_receive_lsn)
  6. Alert on-call team; begin RCA

RTO: 1-5 minutes. RPO (Recovery Point Objective): data up to the replication lag (typically <1s, worst case 30s if monitoring threshold).

Conflict Resolution for Active-Active

When two regions accept writes for the same record concurrently: (1) Last-Write-Wins (LWW): compare timestamps; most recent update wins. Risk: clock skew between regions can cause earlier writes to incorrectly win. Use hybrid logical clocks (HLC) instead of wall clocks. (2) Application-level merging: for counters, use CRDTs (Conflict-free Replicated Data Types) — a CRDT counter merges by taking the max of each node’s value. (3) Conflict detection + manual resolution: detect conflicts (same record modified in two regions during the same window), store both versions, application resolves on next read. Used by Amazon’s shopping cart (Dynamo paper).

Key Design Decisions

  • Active-passive for most services — simpler consistency, acceptable 1-5min failover
  • Active-active only for user-specific data where region affinity eliminates most conflicts
  • CDC via Kafka for cross-region replication — replayable, auditable, decoupled
  • Global LB with health checks — faster failover than GeoDNS alone
  • Monitor replication lag continuously — stale replica is worse than no replica


{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”What is the difference between active-passive and active-active multi-region architectures?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Active-passive: one primary region handles all write traffic; secondary regions serve reads only and are promoted to primary on failure. Simpler — no conflict resolution needed because there is only one write source. Failover takes 1-5 minutes (health check detection + DNS propagation + replica promotion). RPO (data loss): typically < 1 second of replication lag. Active-active: all regions accept both reads and writes simultaneously. Lowest read and write latency globally — users always write to a nearby region. Requires conflict resolution for concurrent writes to the same record from different regions (last-write-wins, CRDTs, or application merge). Used by DynamoDB Global Tables, Cassandra, CockroachDB. Choose active-passive unless you specifically need write latency below 100ms from multiple geographies or need to survive writes during a primary region outage.”}},{“@type”:”Question”,”name”:”What is CDC (Change Data Capture) and how does it enable cross-region replication?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”CDC captures every database change (INSERT, UPDATE, DELETE) as a stream of events in real time. Tools: Debezium (open source, supports PostgreSQL, MySQL, MongoDB), AWS DMS, Google Datastream. Debezium reads the database write-ahead log (WAL) and publishes change events to Kafka. For cross-region replication: Kafka MirrorMaker 2 replicates topics from the primary region Kafka cluster to the secondary region cluster. The secondary region consumes the change stream and applies it to its read replica. Lag: typically < 1 second. Benefits over logical replication: (1) Kafka provides durable storage and replayability — if the secondary region is down, it catches up when it recovers. (2) Multiple consumers can use the same change stream for different purposes (replication, audit, cache invalidation). (3) Decoupled — primary DB is not directly connected to secondary DB.”}},{“@type”:”Question”,”name”:”How does global load balancing route traffic to the nearest healthy region?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Three approaches: (1) GeoDNS: the DNS resolver returns different IP addresses based on the client's geographic location. Simple to implement; failover speed limited by DNS TTL (60-300 seconds). (2) Anycast: the same IP address is announced from multiple PoPs (points of presence) via BGP. Network routing automatically directs clients to the topologically nearest PoP. Used by Cloudflare and CDNs for <20ms routing decisions. No DNS TTL delay on failover. (3) AWS Global Accelerator / GCP Global LB: managed global load balancer with health checks. Routes TCP/UDP traffic to the nearest healthy region. Health check failure detected in 10-30 seconds; failover in 30-60 seconds. Recommendation: use Global Accelerator or GCP Global LB for API traffic (fastest failover, no DNS dependency), and GeoDNS + CDN for static assets.”}},{“@type”:”Question”,”name”:”What are RTO and RPO, and what values are achievable with multi-region?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”RTO (Recovery Time Objective): maximum acceptable time from failure to restored service. How long can you be down? RPO (Recovery Point Objective): maximum acceptable data loss measured in time. How much data can you afford to lose? With active-passive multi-region: RTO = 1-5 minutes (time to detect failure, promote replica, and update routing). RPO = replication lag at time of failure, typically < 1 second with synchronous replication to a standby, or < 30 seconds with asynchronous replication. With active-active: RTO = seconds (traffic already flowing to all regions; failed region just stops receiving new traffic). RPO = near-zero for writes that succeeded in any region before failure. Without multi-region (single AZ failover): RTO = 1-10 minutes, RPO = seconds. Without any HA: RTO = hours (manual restore from backup), RPO = hours (last backup).”}},{“@type”:”Question”,”name”:”How do you handle conflict resolution in an active-active multi-region database?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Three strategies: (1) Last-Write-Wins (LWW): compare timestamps; the write with the higher timestamp wins. Problem: clock skew between servers — clocks can differ by seconds. Solution: use hybrid logical clocks (HLC) that combine physical time with a logical counter, ensuring monotonically increasing timestamps across nodes. (2) CRDTs (Conflict-free Replicated Data Types): data structures that merge without conflicts by design. G-Counter (grow-only counter): each node tracks its own count; merge by taking the max per node. PN-Counter: pair of G-Counters for increment and decrement. LWW-Register: stores (value, timestamp). CRDTs work for counters, sets, and registers but not arbitrary relational data. (3) Application-level resolution: detect conflicts (two writes to the same record within a time window), store both versions, resolve on next read. Used by Amazon's Dynamo for shopping carts. Operationally simplest but requires application logic for each entity type.”}}]}

Atlassian system design covers distributed and multi-region architecture. See common questions for Atlassian interview: multi-region architecture and high availability design.

Amazon system design covers global infrastructure and multi-region architecture. Review patterns for Amazon interview: multi-region and global infrastructure design.

Databricks system design covers multi-region data replication. See design patterns for Databricks interview: multi-region data replication and consistency.

Scroll to Top