Active-Passive Architecture Low-Level Design: Standby Promotion, Replication Monitoring, and Failover Automation

What Is Active-Passive Architecture?

In an active-passive architecture, a single primary node handles all incoming traffic — reads and writes. One or more passive standbys continuously replicate from the primary but do not serve any application traffic. On primary failure, one standby is promoted to primary and begins accepting traffic. Active-passive trades the complexity of conflict resolution (required in active-active) for the simplicity of a single authoritative writer, at the cost of standby capacity sitting idle and a failover gap when the primary fails.

Standby Types

Hot standby: fully synchronized with the primary in real time; ready to accept connections immediately on promotion; highest cost (identical hardware running idle).
Warm standby: continuously replicating but requires a brief startup or configuration step before it can serve traffic (e.g., recovering from a partially applied WAL segment); trades some recovery speed for reduced cost.
Cold standby: restored from periodic backups; no continuous replication; cheapest to operate but slowest to promote — recovery time can be hours depending on backup frequency and data volume.

Replication

PostgreSQL uses streaming WAL replication: the standby connects to the primary's WAL sender process and continuously receives WAL segments as they are generated. The standby applies WAL in recovery mode, maintaining its own replay LSN. MySQL uses binlog replication: the replica connects to the primary's binlog dump thread and applies relay log events. Both expose lag metrics (replay lag in PG, Seconds_Behind_Master in MySQL) that are critical for estimating RPO at the moment of failover.

Health Monitoring

A health monitor (HAProxy, Patroni, MHA, or a custom agent) continuously checks primary liveness via TCP connection attempts, SQL ping queries (SELECT 1), or heartbeat table writes. The virtual IP (VIP) or DNS record for the cluster points to the primary. The monitor checks every N seconds (typically 1-5 seconds) and declares failure after M consecutive missed checks to avoid flapping on transient network glitches. Dead-man's-switch patterns invert this: the primary must actively write a heartbeat; silence implies failure.

Failover Automation

Automated failover sequence:

Monitor confirms primary is unreachable (M consecutive missed checks).
Fence the old primary (STONITH) to guarantee it cannot accept writes before promotion.
Select the standby with the highest apply_lsn — minimizes data loss.
Promote the selected standby: signal it to exit recovery mode and begin accepting writes.
Update the VIP or DNS record to point to the new primary.
Redirect remaining standbys to replicate from the new primary.
Alert on-call with data loss estimate (lag at promotion time).

Split-Brain Prevention: STONITH Fencing

Split-brain is the catastrophic scenario where two nodes simultaneously believe they are the primary — both accept writes, data diverges, and there is no clean merge path. The standard prevention mechanism is STONITH: Shoot The Other Node In The Head.

Before promoting a standby, the fencing agent sends a kill command to the old primary via an out-of-band channel (IPMI, iLO, AWS EC2 force-stop API, PDU power cut). The old primary is powered off or network-isolated before the standby begins accepting writes. This guarantees that at most one node can accept writes at any instant, even if the network partition is partial.

Quorum-Based Split-Brain Prevention

In a 3+ node cluster, quorum provides a software alternative to STONITH. Promotion requires a majority vote. If a node cannot reach a majority of peers, it refuses to promote and continues in standby mode. A minority-side node that incorrectly suspects the primary is down cannot unilaterally promote itself — it needs quorum to do so. Patroni (PostgreSQL) and Orchestrator (MySQL) implement quorum-based promotion using etcd or ZooKeeper as the distributed consensus store.

SQL Schema

CREATE TABLE ClusterNode (
    id                BIGSERIAL    PRIMARY KEY,
    role              VARCHAR(16)  NOT NULL CHECK (role IN ('primary', 'standby')),
    address           VARCHAR(256) NOT NULL,
    apply_lsn         BIGINT       NOT NULL DEFAULT 0,
    last_heartbeat_at TIMESTAMPTZ  NOT NULL DEFAULT now()
);

CREATE TABLE FailoverRecord (
    id              BIGSERIAL    PRIMARY KEY,
    old_primary_id  BIGINT       NOT NULL REFERENCES ClusterNode(id),
    new_primary_id  BIGINT       NOT NULL REFERENCES ClusterNode(id),
    trigger_type    VARCHAR(64)  NOT NULL,
    data_loss_lsn   BIGINT,
    promoted_at     TIMESTAMPTZ  NOT NULL DEFAULT now()
);

CREATE TABLE FencingAction (
    id          BIGSERIAL    PRIMARY KEY,
    node_id     BIGINT       NOT NULL REFERENCES ClusterNode(id),
    action_type VARCHAR(64)  NOT NULL,
    executed_at TIMESTAMPTZ  NOT NULL DEFAULT now(),
    result      VARCHAR(32)  NOT NULL
);

CREATE INDEX idx_cn_role      ON ClusterNode    (role);
CREATE INDEX idx_cn_heartbeat ON ClusterNode    (last_heartbeat_at DESC);
CREATE INDEX idx_fr_promoted  ON FailoverRecord (promoted_at DESC);

Python Implementation

import time
import threading
from typing import Optional, List
from dataclasses import dataclass

@dataclass
class ClusterNode:
    node_id:        int
    address:        str
    role:           str     # "primary" or "standby"
    apply_lsn:      int = 0
    last_heartbeat: float = 0.0

class ActivePassiveManager:
    CHECK_INTERVAL     = 2.0    # seconds between health checks
    FAILURE_THRESHOLD  = 3      # consecutive missed checks before declaring failure

    def __init__(self, nodes: List[ClusterNode]):
        self.nodes   = {n.node_id: n for n in nodes}
        self._lock   = threading.Lock()
        self._miss   = {}      # node_id -> consecutive miss count

    def _primary(self) -> Optional[ClusterNode]:
        return next((n for n in self.nodes.values() if n.role == 'primary'), None)

    def _best_standby(self) -> Optional[ClusterNode]:
        standbys = [n for n in self.nodes.values() if n.role == 'standby']
        return max(standbys, key=lambda n: n.apply_lsn, default=None)

    def monitor_primary(self, cluster_id: int):
        """Poll primary health; trigger failover after threshold misses."""
        while True:
            primary = self._primary()
            if primary is None:
                time.sleep(self.CHECK_INTERVAL)
                continue
            alive = self._ping(primary)
            with self._lock:
                if alive:
                    self._miss[primary.node_id] = 0
                else:
                    self._miss[primary.node_id] = (
                        self._miss.get(primary.node_id, 0) + 1
                    )
                    if self._miss[primary.node_id] >= self.FAILURE_THRESHOLD:
                        print(f"Primary {primary.node_id} declared dead.")
                        standby = self._best_standby()
                        if standby:
                            self.trigger_failover(primary.node_id, standby.node_id)
            time.sleep(self.CHECK_INTERVAL)

    def _ping(self, node: ClusterNode) -> bool:
        """Simulate health check — replace with real TCP/SQL probe."""
        import random
        return random.random() > 0.05

    def fence_old_primary(self, primary_id: int) -> bool:
        """
        Send STONITH command to old primary before promoting standby.
        In production: IPMI power-off, EC2 force-stop, PDU cut.
        """
        print(f"STONITH: fencing node {primary_id}...")
        # Simulate out-of-band kill
        node = self.nodes.get(primary_id)
        if node:
            node.role = 'fenced'
        print(f"Node {primary_id} fenced successfully.")
        return True

    def trigger_failover(self, primary_id: int, standby_id: int):
        lag = self.nodes[primary_id].apply_lsn - self.nodes[standby_id].apply_lsn
        print(f"Failover initiated. Estimated data loss: {lag} LSN units.")
        if not self.fence_old_primary(primary_id):
            raise RuntimeError("Fencing failed — aborting failover to prevent split-brain.")
        self.nodes[standby_id].role = 'primary'
        print(f"Node {standby_id} promoted to primary.")
        self.update_routing(standby_id)

    def update_routing(self, new_primary_id: int):
        """Update VIP or DNS to point to new primary."""
        node = self.nodes[new_primary_id]
        print(f"Routing updated: VIP now points to {node.address}.")

RPO and RTO Considerations

RPO (Recovery Point Objective): with synchronous replication to a hot standby, RPO = 0. With asynchronous replication, RPO = replication lag at failure time. Monitor lag continuously and set alerts when lag approaches the RPO budget.
RTO (Recovery Time Objective): hot standby with automated failover achieves RTO of 10-30 seconds (detection time + fencing + promotion + routing update). Warm standby adds startup time. Cold standby RTO is bounded by backup restore time — potentially hours.

Frequently Asked Questions

What is the difference between hot, warm, and cold standby?

A hot standby is fully synchronized and accepts connections immediately on promotion — zero startup delay. A warm standby continuously replicates but requires a brief step (completing WAL replay, updating configuration) before serving traffic, typically adding 10-60 seconds to RTO. A cold standby is restored from periodic backups with no continuous replication; promotion time is bounded by the backup restore duration, which can be hours for large databases.

What is STONITH and why is it required for safe failover?

STONITH (Shoot The Other Node In The Head) is a fencing mechanism that physically isolates or powers off the old primary before promoting a standby. Without fencing, the old primary might recover from a transient partition and resume accepting writes simultaneously with the newly promoted primary — split-brain. STONITH guarantees that at most one node can accept writes at any moment by ensuring the old primary is definitively offline before the new one starts.

How is split-brain detected without STONITH?

In quorum-based systems (Patroni with etcd, Orchestrator with ZooKeeper), a node can only become primary if it holds a distributed lock or receives a majority vote. A node that cannot reach a quorum of peers refuses to promote. If two partitions each have a node trying to become primary, only the partition with a quorum majority succeeds. The minority-side node remains in standby mode until it can rejoin the majority partition.

What is the RPO when using async replication in active-passive?

RPO with async replication equals the replication lag at the exact moment the primary fails. If the standby was 5 seconds behind, the last 5 seconds of committed writes are lost. RPO is non-deterministic and depends on network conditions, write rate, and standby load. To bound RPO, use synchronous replication to at least one standby (semi-sync), monitor lag continuously, and alert when lag exceeds the RPO budget.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What is the difference between hot, warm, and cold standby?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Hot standby is fully synchronized and serves traffic immediately on promotion. Warm standby replicates continuously but needs a brief startup step, adding 10-60 seconds to RTO. Cold standby is restored from backups with no continuous replication; promotion time can be hours.”
}
},
{
“@type”: “Question”,
“name”: “What is STONITH and why is it required for safe failover?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “STONITH (Shoot The Other Node In The Head) physically isolates or powers off the old primary before promoting a standby. Without it, the old primary might recover and resume writes simultaneously with the new primary, causing split-brain. STONITH ensures at most one node accepts writes at any moment.”
}
},
{
“@type”: “Question”,
“name”: “How is split-brain detected without STONITH?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Quorum-based systems require a distributed lock or majority vote before promotion. A node that cannot reach a quorum refuses to become primary. Only the partition with majority quorum succeeds; the minority-side node remains standby until it can rejoin.”
}
},
{
“@type”: “Question”,
“name”: “What is the RPO when using async replication in active-passive?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “RPO equals the replication lag at the moment of primary failure. If the standby was 5 seconds behind, the last 5 seconds of writes are lost. To bound RPO, use semi-sync replication to at least one standby and alert when lag exceeds the RPO budget.”
}
}
]
}

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How is the passive standby promoted to active during failover?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A health-check system or consensus mechanism detects the active node's failure and triggers promotion by updating routing (DNS, load balancer, or virtual IP) to point to the standby, which is then allowed to accept writes. The standby must first finish replaying any buffered replication logs to avoid serving stale data.”
}
},
{
“@type”: “Question”,
“name”: “How is replication lag monitored to assess failover readiness?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Replication lag is measured as the difference between the active's latest write position (LSN in PostgreSQL, binlog position in MySQL) and the position the standby has applied; monitoring systems alert when lag exceeds a threshold. Operators use this metric to decide whether a failover would result in acceptable data loss before initiating promotion.”
}
},
{
“@type”: “Question”,
“name”: “What is split-brain and how is it prevented in active-passive?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Split-brain occurs when both the active and the passive believe they are the sole writer simultaneously — typically due to a network partition — leading to divergent data. It is prevented using a STONITH (Shoot The Other Node In The Head) fencing mechanism or requiring a quorum witness before the standby is allowed to promote.”
}
},
{
“@type”: “Question”,
“name”: “How does DNS-based failover work?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “DNS-based failover works by health-checking the active endpoint and, upon failure, updating the DNS record to point to the standby's IP address. The recovery time is bounded by the DNS TTL, so short TTLs (30–60 seconds) are used to minimize the window during which clients resolve the old address.”
}
}
]
}