Multi-Region Failover System Low-Level Design: Health Monitoring, Lag-Gated Promotion, and DNS-Based Traffic Cutover

Multi-Region Failover System: Low-Level Design

A multi-region failover system routes traffic away from a failing region to a healthy standby, minimizing downtime for customers when an entire AWS region or data center becomes unavailable. The design covers health monitoring across regions, DNS-based failover, database replication state at failover time, and the automated vs. manual promotion decision. The central tension is between fast automatic failover (lower downtime) and safe failover (lower risk of split-brain or premature promotion).

Core Data Model

CREATE TABLE Region (
    region_id      SERIAL PRIMARY KEY,
    region_key     VARCHAR(30) UNIQUE NOT NULL,  -- 'us-east-1', 'eu-west-1', 'ap-southeast-1'
    role           VARCHAR(20) NOT NULL,         -- primary, standby, disabled
    priority       SMALLINT NOT NULL DEFAULT 0, -- failover order: higher = preferred standby
    dns_weight     SMALLINT NOT NULL DEFAULT 100,
    health_endpoint VARCHAR(500) NOT NULL,       -- URL for health checks
    last_check_at  TIMESTAMPTZ,
    is_healthy     BOOLEAN NOT NULL DEFAULT TRUE,
    db_lag_seconds NUMERIC(8,2),                -- replication lag from primary
    updated_at     TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE TABLE FailoverEvent (
    event_id       BIGSERIAL PRIMARY KEY,
    from_region    VARCHAR(30) NOT NULL,
    to_region      VARCHAR(30) NOT NULL,
    trigger        VARCHAR(30) NOT NULL,  -- automatic, manual, scheduled_test
    triggered_by   BIGINT,               -- user_id for manual; NULL for automatic
    status         VARCHAR(20) NOT NULL DEFAULT 'initiated',
    -- initiated, promoting_db, updating_dns, verifying, complete, failed
    db_lag_at_failover NUMERIC(8,2),
    started_at     TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    completed_at   TIMESTAMPTZ,
    dns_updated_at TIMESTAMPTZ,
    notes          TEXT
);

CREATE TABLE HealthCheckResult (
    check_id       BIGSERIAL PRIMARY KEY,
    region_key     VARCHAR(30) NOT NULL,
    is_healthy     BOOLEAN NOT NULL,
    latency_ms     INT,
    error_message  TEXT,
    checked_at     TIMESTAMPTZ NOT NULL DEFAULT NOW()
) PARTITION BY RANGE (checked_at);

CREATE INDEX ON HealthCheckResult(region_key, checked_at DESC);

Health Monitoring

import requests, datetime, statistics

HEALTH_CHECK_INTERVAL_SECONDS = 30
CONSECUTIVE_FAILURES_THRESHOLD = 3   # require 3 consecutive failures before marking unhealthy
AUTO_FAILOVER_DB_LAG_LIMIT = 30      # don't auto-failover if standby lag > 30s

def check_region_health(region_key: str) -> bool:
    region = db.fetchone("SELECT * FROM Region WHERE region_key=%s", (region_key,))
    if not region:
        return False

    try:
        start = time.time()
        resp = requests.get(region['health_endpoint'], timeout=10)
        latency_ms = int((time.time() - start) * 1000)
        is_healthy = resp.status_code == 200
        error_msg = None if is_healthy else f"HTTP {resp.status_code}"
    except Exception as e:
        is_healthy = False
        latency_ms = None
        error_msg = str(e)[:200]

    db.execute("""
        INSERT INTO HealthCheckResult (region_key, is_healthy, latency_ms, error_message)
        VALUES (%s,%s,%s,%s)
    """, (region_key, is_healthy, latency_ms, error_msg))

    # Update region status — require 3 consecutive failures to mark unhealthy
    recent_checks = db.fetchall("""
        SELECT is_healthy FROM HealthCheckResult
        WHERE region_key=%s ORDER BY checked_at DESC LIMIT 3
    """, (region_key,))

    consecutive_failures = sum(1 for c in recent_checks if not c['is_healthy'])
    region_healthy = consecutive_failures  float:
    """Query the standby database for its replication lag from the primary."""
    standby_conn = _get_db_conn(standby_region)
    row = standby_conn.fetchone("""
        SELECT EXTRACT(EPOCH FROM (NOW() - pg_last_xact_replay_timestamp()))
        AS lag_seconds
    """)
    return float(row['lag_seconds']) if row and row['lag_seconds'] else 999.0

Automated Failover Decision Engine

def evaluate_failover_need() -> dict:
    """
    Run every health check cycle. Returns action: 'none', 'alert', or 'failover'.
    Conservative: only auto-failover if primary is clearly down AND standby is ready.
    """
    primary = db.fetchone(
        "SELECT * FROM Region WHERE role='primary'"
    )
    if not primary:
        return {'action': 'none', 'reason': 'no primary configured'}

    if primary['is_healthy']:
        return {'action': 'none', 'reason': 'primary healthy'}

    # Primary is unhealthy — find best standby
    standbys = db.fetchall("""
        SELECT * FROM Region WHERE role='standby' AND is_healthy=TRUE
        ORDER BY priority DESC
    """)

    if not standbys:
        return {'action': 'alert',
                'reason': 'primary down and no healthy standbys — manual intervention required'}

    best_standby = standbys[0]

    # Check replication lag — don't auto-failover if standby is too far behind
    lag = get_db_replication_lag(best_standby['region_key'])
    db.execute("""
        UPDATE Region SET db_lag_seconds=%s WHERE region_key=%s
    """, (lag, best_standby['region_key']))

    if lag > AUTO_FAILOVER_DB_LAG_LIMIT:
        return {
            'action': 'alert',
            'reason': f"Standby lag {lag:.1f}s exceeds limit {AUTO_FAILOVER_DB_LAG_LIMIT}s — manual required",
            'lag_seconds': lag,
        }

    return {
        'action': 'failover',
        'from_region': primary['region_key'],
        'to_region': best_standby['region_key'],
        'lag_seconds': lag,
    }

def execute_failover(from_region: str, to_region: str, trigger: str = 'automatic',
                     triggered_by: int = None) -> int:
    """
    Execute the failover sequence:
    1. Promote standby DB to primary
    2. Update DNS weights
    3. Verify traffic flowing to new region
    """
    lag = get_db_replication_lag(to_region)
    event_id = db.fetchone("""
        INSERT INTO FailoverEvent (from_region, to_region, trigger, triggered_by, db_lag_at_failover)
        VALUES (%s,%s,%s,%s,%s) RETURNING event_id
    """, (from_region, to_region, trigger, triggered_by, lag))['event_id']

    try:
        # Step 1: Promote standby database
        db.execute("""
            UPDATE FailoverEvent SET status='promoting_db' WHERE event_id=%s
        """, (event_id,))
        _promote_standby_db(to_region)

        # Step 2: Update region roles
        db.execute("""
            UPDATE Region SET role='disabled', updated_at=NOW() WHERE region_key=%s;
            UPDATE Region SET role='primary', updated_at=NOW() WHERE region_key=%s;
        """, (from_region, to_region))

        # Step 3: Update DNS (shift all traffic to new primary)
        db.execute("""
            UPDATE FailoverEvent SET status='updating_dns', dns_updated_at=NOW()
            WHERE event_id=%s
        """, (event_id,))
        _update_dns_weights(to_region, weight=100)

        db.execute("""
            UPDATE FailoverEvent SET status='complete', completed_at=NOW()
            WHERE event_id=%s
        """, (event_id,))
        _send_failover_notification(from_region, to_region, lag)

    except Exception as e:
        db.execute("""
            UPDATE FailoverEvent SET status='failed', notes=%s WHERE event_id=%s
        """, (str(e)[:500], event_id))
        raise

    return event_id

def _promote_standby_db(region_key: str):
    """
    Tell the standby Postgres to become primary.
    In Patroni: call the patroni REST API. In AWS RDS: call promote_read_replica().
    """
    pass  # Implementation depends on HA manager

def _update_dns_weights(primary_region: str, weight: int = 100):
    """
    Update DNS (Route53, Cloudflare) to route traffic to new primary.
    TTL must be low (60s) during failover preparation.
    """
    pass  # Call DNS provider API

Key Design Decisions

3 consecutive failures before marking unhealthy: a single failed health check might be a transient network glitch, a brief GC pause, or a health check timeout. Three consecutive failures spanning 90 seconds (3 × 30s) provide confidence that the region is genuinely down before triggering alerting or failover. Too few failures = false positives; too many = slow detection.
Replication lag gate on auto-failover: if the standby has 60 seconds of lag, promoting it means losing 60 seconds of writes — accepted payments, placed orders, sent messages that will disappear. Auto-failover is blocked when lag exceeds 30 seconds; instead, alert for human judgment. Humans can decide whether to accept the data loss or wait for the lag to reduce. For synchronous replication (zero lag), auto-failover is always safe but primary write latency increases.
Low DNS TTL during normal operations: set DNS TTL to 60 seconds (not 300 or 3600) for the primary endpoint. This means DNS changes propagate to clients within 60 seconds after a failover. The cost: higher DNS query rate during normal operations. Pre-set low TTL — you cannot lower it instantly during an incident; clients have already cached the 300s TTL.
Failover event audit trail: FailoverEvent records every state transition with timestamps. Post-incident review requires knowing exactly when the failover was initiated, how long DB promotion took, when DNS was updated, and what the replication lag was. This data drives RTO (recovery time objective) improvements.

{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”What is RTO and RPO, and how do they shape the failover system design?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”RTO (Recovery Time Objective): the maximum acceptable downtime between failure detection and service restoration. An RTO of 5 minutes means the failover system must detect the failure, promote a standby, update DNS, and confirm traffic is flowing to the new region — all within 5 minutes. This drives how fast health checks run (every 30s for 3 consecutive failures = 90s detection) and whether promotion is automated (automated = faster) or manual (manual = safer but slower). RPO (Recovery Point Objective): the maximum acceptable data loss — how many seconds or minutes of committed writes can be lost when the primary fails. An RPO of 0 requires synchronous replication (every write is committed on both primary and standby before acknowledging). An RPO of 60 seconds allows asynchronous replication with up to 60s lag. Synchronous replication guarantees zero data loss but increases write latency by one network round-trip to the standby region (typically 20–50ms for inter-region). Choose: 0s RPO = use synchronous replication + accept write latency penalty; >0s RPO = use async replication + accept potential data loss window.”}},{“@type”:”Question”,”name”:”How do you prevent split-brain when the primary region is temporarily unreachable but not actually down?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Split-brain: two regions both believe they are primary and accept writes simultaneously. After 30 minutes, the "failed" primary recovers and both regions have diverged write histories — unresolvable without manual data reconciliation. Prevention: (1) quorum-based fencing: use an odd number of regions (3 or 5) and require a quorum (majority) to elect a new primary. If us-east-1 is unreachable by eu-west-1 but can still reach ap-southeast-1, it retains quorum and remains primary — eu-west-1 cannot promote alone. (2) STONITH (Shoot The Other Node In The Head): before promoting the standby, forcibly fence the old primary by shutting down its network interface (via AWS EC2 API, cloud provider control plane) so it cannot accept writes even if it recovers. (3) Conservative promotion threshold: require 3+ consecutive health check failures (90s) from multiple monitoring locations before initiating failover, not just from one location that might have a network partition.”}},{“@type”:”Question”,”name”:”How do you minimize data loss when failing over with replication lag?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”If failover is initiated while the standby has 30 seconds of replication lag, those 30 seconds of writes (payments, orders, user actions) are permanently lost — they were on the primary’s WAL but never replicated. Minimization strategies: (1) synchronous replication: zero lag, zero data loss. Trade-off: every write waits for the standby to confirm, adding 20–50ms write latency. Use synchronous_commit=remote_write in Postgres — less strict than synchronous_commit=on but significantly faster while still protecting against data loss on primary crash; (2) wait for lag to decrease: before promoting, check lag every 5 seconds and wait up to 60 seconds for it to drop below 5 seconds. If lag doesn’t improve (standby is also degraded), promote with the current lag and accept the data loss; (3) application-level replay: before promoting, read any unreplicated WAL from the primary (if it is still reachable but degraded) and apply it to the standby manually. This is complex but reduces effective data loss to near-zero even with async replication.”}},{“@type”:”Question”,”name”:”How does a multi-region active-active setup differ from active-passive failover?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Active-passive (this design): one primary region handles all writes and reads; the standby region handles no production traffic until failover. Simple, strong consistency, but the standby capacity is wasted during normal operation. Active-active: both regions handle live traffic (reads and writes). Users are routed to their nearest region. Global writes must be reconciled across regions — typically via conflict-free replicated data types (CRDTs) or last-write-wins with vector clocks. Much more complex to implement. Use cases: (1) active-active for reads only (replicate writes from primary to all standbys; reads served locally) — reduces read latency globally without write complexity; (2) active-active for writes with geographic partitioning (EU users’ data lives in eu-west-1, US users’ in us-east-1) — no cross-region write conflicts because data is partitioned by region; (3) full active-active with CRDTs (Riak, Cassandra) — for counters and sets that can merge deterministically. For most SaaS products with strong consistency requirements, active-passive with fast automated failover is the right choice.”}},{“@type”:”Question”,”name”:”How do you test your failover system without causing a real outage?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Failover systems that are never tested fail when you need them. Testing approaches: (1) scheduled failover drills: once per quarter, execute a real failover to the standby region during a low-traffic window. Measure actual RTO (time from "simulate failure" to "traffic flowing to standby"). Document gaps between expected and actual RTO. (2) chaos engineering: randomly terminate primary region services (Netflix Chaos Monkey style) in pre-production environments to verify the failover triggers correctly. (3) DNS cutover test without DB failover: update DNS to point to the standby region but keep the standby pointing to the primary database. This tests DNS propagation time without risking data loss. (4) read replica promotion test: periodically promote a read replica to primary (in a non-production environment) to practice the DB promotion procedure and measure how long it takes. The target: your entire team can execute a failover from memory within 5 minutes. If the runbook is complex, simplify until it isn’t.”}}]}

Multi-region failover and high-availability system design is discussed in Amazon system design interview questions.

Multi-region failover and disaster recovery design is covered in Netflix system design interview preparation.

Multi-region failover and global infrastructure design is discussed in Uber system design interview guide.