Multi-Region Failover System Low-Level Design: Health Monitoring, Lag-Gated Promotion, and DNS-Based Traffic Cutover

Multi-Region Failover System: Low-Level Design

A multi-region failover system routes traffic away from a failing region to a healthy standby, minimizing downtime for customers when an entire AWS region or data center becomes unavailable. The design covers health monitoring across regions, DNS-based failover, database replication state at failover time, and the automated vs. manual promotion decision. The central tension is between fast automatic failover (lower downtime) and safe failover (lower risk of split-brain or premature promotion).

Core Data Model

CREATE TABLE Region (
    region_id      SERIAL PRIMARY KEY,
    region_key     VARCHAR(30) UNIQUE NOT NULL,  -- 'us-east-1', 'eu-west-1', 'ap-southeast-1'
    role           VARCHAR(20) NOT NULL,         -- primary, standby, disabled
    priority       SMALLINT NOT NULL DEFAULT 0, -- failover order: higher = preferred standby
    dns_weight     SMALLINT NOT NULL DEFAULT 100,
    health_endpoint VARCHAR(500) NOT NULL,       -- URL for health checks
    last_check_at  TIMESTAMPTZ,
    is_healthy     BOOLEAN NOT NULL DEFAULT TRUE,
    db_lag_seconds NUMERIC(8,2),                -- replication lag from primary
    updated_at     TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE TABLE FailoverEvent (
    event_id       BIGSERIAL PRIMARY KEY,
    from_region    VARCHAR(30) NOT NULL,
    to_region      VARCHAR(30) NOT NULL,
    trigger        VARCHAR(30) NOT NULL,  -- automatic, manual, scheduled_test
    triggered_by   BIGINT,               -- user_id for manual; NULL for automatic
    status         VARCHAR(20) NOT NULL DEFAULT 'initiated',
    -- initiated, promoting_db, updating_dns, verifying, complete, failed
    db_lag_at_failover NUMERIC(8,2),
    started_at     TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    completed_at   TIMESTAMPTZ,
    dns_updated_at TIMESTAMPTZ,
    notes          TEXT
);

CREATE TABLE HealthCheckResult (
    check_id       BIGSERIAL PRIMARY KEY,
    region_key     VARCHAR(30) NOT NULL,
    is_healthy     BOOLEAN NOT NULL,
    latency_ms     INT,
    error_message  TEXT,
    checked_at     TIMESTAMPTZ NOT NULL DEFAULT NOW()
) PARTITION BY RANGE (checked_at);

CREATE INDEX ON HealthCheckResult(region_key, checked_at DESC);

Health Monitoring

import requests, datetime, statistics

HEALTH_CHECK_INTERVAL_SECONDS = 30
CONSECUTIVE_FAILURES_THRESHOLD = 3   # require 3 consecutive failures before marking unhealthy
AUTO_FAILOVER_DB_LAG_LIMIT = 30      # don't auto-failover if standby lag > 30s

def check_region_health(region_key: str) -> bool:
    region = db.fetchone("SELECT * FROM Region WHERE region_key=%s", (region_key,))
    if not region:
        return False

    try:
        start = time.time()
        resp = requests.get(region['health_endpoint'], timeout=10)
        latency_ms = int((time.time() - start) * 1000)
        is_healthy = resp.status_code == 200
        error_msg = None if is_healthy else f"HTTP {resp.status_code}"
    except Exception as e:
        is_healthy = False
        latency_ms = None
        error_msg = str(e)[:200]

    db.execute("""
        INSERT INTO HealthCheckResult (region_key, is_healthy, latency_ms, error_message)
        VALUES (%s,%s,%s,%s)
    """, (region_key, is_healthy, latency_ms, error_msg))

    # Update region status — require 3 consecutive failures to mark unhealthy
    recent_checks = db.fetchall("""
        SELECT is_healthy FROM HealthCheckResult
        WHERE region_key=%s ORDER BY checked_at DESC LIMIT 3
    """, (region_key,))

    consecutive_failures = sum(1 for c in recent_checks if not c['is_healthy'])
    region_healthy = consecutive_failures  float:
    """Query the standby database for its replication lag from the primary."""
    standby_conn = _get_db_conn(standby_region)
    row = standby_conn.fetchone("""
        SELECT EXTRACT(EPOCH FROM (NOW() - pg_last_xact_replay_timestamp()))
        AS lag_seconds
    """)
    return float(row['lag_seconds']) if row and row['lag_seconds'] else 999.0

Automated Failover Decision Engine

def evaluate_failover_need() -> dict:
    """
    Run every health check cycle. Returns action: 'none', 'alert', or 'failover'.
    Conservative: only auto-failover if primary is clearly down AND standby is ready.
    """
    primary = db.fetchone(
        "SELECT * FROM Region WHERE role='primary'"
    )
    if not primary:
        return {'action': 'none', 'reason': 'no primary configured'}

    if primary['is_healthy']:
        return {'action': 'none', 'reason': 'primary healthy'}

    # Primary is unhealthy — find best standby
    standbys = db.fetchall("""
        SELECT * FROM Region WHERE role='standby' AND is_healthy=TRUE
        ORDER BY priority DESC
    """)

    if not standbys:
        return {'action': 'alert',
                'reason': 'primary down and no healthy standbys — manual intervention required'}

    best_standby = standbys[0]

    # Check replication lag — don't auto-failover if standby is too far behind
    lag = get_db_replication_lag(best_standby['region_key'])
    db.execute("""
        UPDATE Region SET db_lag_seconds=%s WHERE region_key=%s
    """, (lag, best_standby['region_key']))

    if lag > AUTO_FAILOVER_DB_LAG_LIMIT:
        return {
            'action': 'alert',
            'reason': f"Standby lag {lag:.1f}s exceeds limit {AUTO_FAILOVER_DB_LAG_LIMIT}s — manual required",
            'lag_seconds': lag,
        }

    return {
        'action': 'failover',
        'from_region': primary['region_key'],
        'to_region': best_standby['region_key'],
        'lag_seconds': lag,
    }

def execute_failover(from_region: str, to_region: str, trigger: str = 'automatic',
                     triggered_by: int = None) -> int:
    """
    Execute the failover sequence:
    1. Promote standby DB to primary
    2. Update DNS weights
    3. Verify traffic flowing to new region
    """
    lag = get_db_replication_lag(to_region)
    event_id = db.fetchone("""
        INSERT INTO FailoverEvent (from_region, to_region, trigger, triggered_by, db_lag_at_failover)
        VALUES (%s,%s,%s,%s,%s) RETURNING event_id
    """, (from_region, to_region, trigger, triggered_by, lag))['event_id']

    try:
        # Step 1: Promote standby database
        db.execute("""
            UPDATE FailoverEvent SET status='promoting_db' WHERE event_id=%s
        """, (event_id,))
        _promote_standby_db(to_region)

        # Step 2: Update region roles
        db.execute("""
            UPDATE Region SET role='disabled', updated_at=NOW() WHERE region_key=%s;
            UPDATE Region SET role='primary', updated_at=NOW() WHERE region_key=%s;
        """, (from_region, to_region))

        # Step 3: Update DNS (shift all traffic to new primary)
        db.execute("""
            UPDATE FailoverEvent SET status='updating_dns', dns_updated_at=NOW()
            WHERE event_id=%s
        """, (event_id,))
        _update_dns_weights(to_region, weight=100)

        db.execute("""
            UPDATE FailoverEvent SET status='complete', completed_at=NOW()
            WHERE event_id=%s
        """, (event_id,))
        _send_failover_notification(from_region, to_region, lag)

    except Exception as e:
        db.execute("""
            UPDATE FailoverEvent SET status='failed', notes=%s WHERE event_id=%s
        """, (str(e)[:500], event_id))
        raise

    return event_id

def _promote_standby_db(region_key: str):
    """
    Tell the standby Postgres to become primary.
    In Patroni: call the patroni REST API. In AWS RDS: call promote_read_replica().
    """
    pass  # Implementation depends on HA manager

def _update_dns_weights(primary_region: str, weight: int = 100):
    """
    Update DNS (Route53, Cloudflare) to route traffic to new primary.
    TTL must be low (60s) during failover preparation.
    """
    pass  # Call DNS provider API

Key Design Decisions

  • 3 consecutive failures before marking unhealthy: a single failed health check might be a transient network glitch, a brief GC pause, or a health check timeout. Three consecutive failures spanning 90 seconds (3 × 30s) provide confidence that the region is genuinely down before triggering alerting or failover. Too few failures = false positives; too many = slow detection.
  • Replication lag gate on auto-failover: if the standby has 60 seconds of lag, promoting it means losing 60 seconds of writes — accepted payments, placed orders, sent messages that will disappear. Auto-failover is blocked when lag exceeds 30 seconds; instead, alert for human judgment. Humans can decide whether to accept the data loss or wait for the lag to reduce. For synchronous replication (zero lag), auto-failover is always safe but primary write latency increases.
  • Low DNS TTL during normal operations: set DNS TTL to 60 seconds (not 300 or 3600) for the primary endpoint. This means DNS changes propagate to clients within 60 seconds after a failover. The cost: higher DNS query rate during normal operations. Pre-set low TTL — you cannot lower it instantly during an incident; clients have already cached the 300s TTL.
  • Failover event audit trail: FailoverEvent records every state transition with timestamps. Post-incident review requires knowing exactly when the failover was initiated, how long DB promotion took, when DNS was updated, and what the replication lag was. This data drives RTO (recovery time objective) improvements.

Multi-region failover and high-availability system design is discussed in Amazon system design interview questions.

Multi-region failover and disaster recovery design is covered in Netflix system design interview preparation.

Multi-region failover and global infrastructure design is discussed in Uber system design interview guide.

Scroll to Top