Leader-Follower Replication Low-Level Design: Leader Election, Log Replication, Read Routing, and Failover

Single-Leader Model

In leader-follower (primary-replica) replication, all writes are routed to a single designated leader. The leader applies writes to its local write-ahead log (WAL) and then propagates log entries to one or more followers. Followers apply log entries in order, maintaining an apply_lsn (log sequence number) tracking how far they have caught up. This model provides strong write consistency — there is one source of truth for the write order — while enabling read scale-out across followers.

Replication Modes

Synchronous (Semi-Synchronous)

The leader waits for at least one follower to acknowledge receipt and persistence of each WAL entry before acknowledging the write to the client. This guarantees zero data loss on leader failure if at least one follower was synchronous. The tradeoff: write latency increases by at least one round-trip to the nearest synchronous follower. MySQL calls this semi-sync; PostgreSQL implements it via synchronous_commit = on with a named synchronous standby.

Asynchronous

The leader acknowledges the write after persisting it locally, before any follower confirms receipt. Writes are replicated to followers in the background. This minimizes write latency but creates a durability window: if the leader crashes before replication, the acknowledged write is lost. This is the PostgreSQL default (synchronous_commit = off or no synchronous standby listed).

Follower Application and Read Serving

Followers apply WAL entries in strict order. The gap between the leader's current LSN and a follower's apply_lsn is the replication lag. A follower can serve reads safely but with staleness proportional to its lag. Applications that can tolerate eventual consistency route reads to followers; applications requiring read-your-writes must route reads back to the leader or use a session-level LSN tracking mechanism to wait until the follower catches up.

Leader Lease

A leader holds a time-bounded lease — a commitment from the quorum that no new leader will be elected until the lease expires. The leader renews the lease by sending periodic heartbeats to a majority of followers. As long as a valid lease exists, followers refuse to initiate an election. This prevents split-brain during brief network partitions: a follower that cannot reach the leader cannot immediately declare itself leader if the leader's lease is still valid. The lease duration must be shorter than the heartbeat timeout to maintain the safety guarantee.

Leader Election

When a follower detects leader failure via heartbeat timeout, it transitions to candidate state. Raft-style election: a candidate requests votes from all peers. Each peer votes for the candidate only if the candidate's log is at least as up-to-date as the peer's own log (last log term and index comparison). A candidate that receives votes from a majority of nodes wins the election and becomes the new leader. The follower with the highest apply_lsn among candidates is most likely to win because it has the most complete log.

Failover Process

Automated failover sequence on leader failure:

  1. Health monitor detects no heartbeat within the timeout window.
  2. Identify the follower with the highest apply_lsn — this minimizes potential data loss.
  3. Promote that follower to leader (update its role, allow it to accept writes).
  4. Update the routing layer (DNS, VIP, or load balancer) to point write traffic at the new leader.
  5. Notify remaining followers to replicate from the new leader.
  6. If the old leader recovers, it must rejoin as a follower and sync from the new leader.

Replica Lag Monitoring

Continuous lag monitoring is essential for:

  • SLA alerting: page if any follower lag exceeds the RPO budget.
  • Read routing: exclude followers whose lag exceeds the application's staleness tolerance.
  • Failover planning: the highest-LSN follower has the least potential data loss.

In PostgreSQL, pg_stat_replication exposes write_lag, flush_lag, and replay_lag per standby. These should be scraped into a time-series store (Prometheus, InfluxDB) and alerted on thresholds.

SQL Schema

CREATE TABLE ReplicationNode (
    id                BIGSERIAL   PRIMARY KEY,
    role              VARCHAR(16) NOT NULL CHECK (role IN ('leader', 'follower')),
    address           VARCHAR(256) NOT NULL,
    apply_lsn         BIGINT       NOT NULL DEFAULT 0,
    last_heartbeat_at TIMESTAMPTZ  NOT NULL DEFAULT now()
);

CREATE TABLE LeaderLease (
    id               BIGSERIAL   PRIMARY KEY,
    leader_id        BIGINT      NOT NULL REFERENCES ReplicationNode(id),
    lease_granted_at TIMESTAMPTZ NOT NULL DEFAULT now(),
    lease_expires_at TIMESTAMPTZ NOT NULL
);

CREATE TABLE FailoverLog (
    id               BIGSERIAL    PRIMARY KEY,
    old_leader_id    BIGINT       NOT NULL REFERENCES ReplicationNode(id),
    new_leader_id    BIGINT       NOT NULL REFERENCES ReplicationNode(id),
    trigger          VARCHAR(128) NOT NULL,
    promoted_at      TIMESTAMPTZ  NOT NULL DEFAULT now(),
    lag_at_promotion BIGINT
);

CREATE INDEX idx_rn_role      ON ReplicationNode (role);
CREATE INDEX idx_rn_heartbeat ON ReplicationNode (last_heartbeat_at DESC);
CREATE INDEX idx_lease_leader ON LeaderLease (leader_id, lease_expires_at DESC);

Python Implementation

import time
import threading
from dataclasses import dataclass, field
from typing import Optional, List

@dataclass
class Node:
    node_id:        int
    address:        str
    role:           str   # "leader" or "follower"
    apply_lsn:      int   = 0
    last_heartbeat: float = field(default_factory=time.time)

class LeaderElector:
    HEARTBEAT_INTERVAL = 1.0
    LEASE_DURATION     = 5.0
    ELECTION_TIMEOUT   = 3.0

    def __init__(self, nodes: List[Node], self_id: int):
        self.nodes   = {n.node_id: n for n in nodes}
        self.self_id = self_id
        self._lease_expiry: Optional[float] = None
        self._lock = threading.Lock()

    @property
    def self_node(self):
        return self.nodes[self.self_id]

    def check_lease(self) -> bool:
        with self._lock:
            return (self._lease_expiry is not None and
                    time.time()  Optional[int]:
        if self.check_lease():
            return None
        candidates = sorted(
            self.nodes.values(),
            key=lambda n: n.apply_lsn,
            reverse=True
        )
        best   = candidates[0]
        votes  = sum(1 for n in self.nodes.values() if n.apply_lsn = quorum:
            return best.node_id
        return None

    def promote_to_leader(self, node_id: int):
        with self._lock:
            for n in self.nodes.values():
                n.role = 'follower'
            self.nodes[node_id].role = 'leader'
            self._lease_expiry = time.time() + self.LEASE_DURATION
        print(f"Node {node_id} promoted to leader.")

    def monitor_lag(self, threshold_lsn: int = 1000) -> List[int]:
        leader = next((n for n in self.nodes.values() if n.role == 'leader'), None)
        if not leader:
            return []
        lagging = [
            n.node_id for n in self.nodes.values()
            if n.role == 'follower' and (leader.apply_lsn - n.apply_lsn) > threshold_lsn
        ]
        if lagging:
            print(f"WARNING: lagging followers: {lagging}")
        return lagging

Read-Your-Writes Consistency

When a client writes to the leader and immediately reads from a follower, the follower may not yet have applied the write. Techniques to ensure read-your-writes:

  • Route reads to leader: simplest but defeats the purpose of followers.
  • Session LSN tracking: the leader returns the write LSN in its response; the client sends this LSN with subsequent reads; the router waits until the selected follower's apply_lsn exceeds the session LSN before routing the read.
  • Sticky session: route all requests from a given client to the same follower; the follower's lag is consistent for that client.

Frequently Asked Questions

What is the difference between semi-sync and async replication?

In semi-sync replication, the leader waits for at least one follower to acknowledge the write before confirming to the client. This guarantees no data loss on leader failure as long as one follower was in sync. In async replication, the leader confirms the write after local persistence only, without waiting for any follower. Async replication has lower write latency but risks losing the most recent writes if the leader crashes before replication completes.

How does a leader lease prevent split-brain?

A leader lease is a time-bounded guarantee from a quorum that no new leader will be elected until the lease expires. If the leader is partitioned from followers but its lease is still valid, followers cannot legally elect a new leader. This prevents two nodes from simultaneously believing they are the leader. The lease must be shorter than the heartbeat timeout to ensure the leader can always renew it while healthy.

Can reads always be served from followers?

Reads can be served from followers when the application tolerates stale data (eventual consistency). For read-your-writes consistency — where a client must see its own recent writes — reads must either go to the leader, or the router must wait until the target follower's apply_lsn has caught up to the LSN of the client's most recent write. The latter approach (session LSN tracking) allows follower reads while preserving read-your-writes guarantees.

How much data is lost during failover with async replication?

The data loss window equals the replication lag of the promoted follower at the moment of leader failure. If the leader had advanced its LSN by 500 since the last replication to the best follower, those 500 log entries are lost. Minimizing this window requires monitoring lag continuously, keeping followers close to the leader (low latency replication path), and using semi-sync for critical data where any loss is unacceptable.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What is the difference between semi-sync and async replication?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Semi-sync waits for at least one follower to acknowledge before confirming to the client, guaranteeing no data loss. Async confirms after local persistence only, with lower latency but risk of losing recent writes if the leader crashes before replication completes.”
}
},
{
“@type”: “Question”,
“name”: “How does a leader lease prevent split-brain?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A leader lease is a time-bounded quorum guarantee that no new leader will be elected until it expires. Partitioned followers cannot legally elect a new leader while the current leader’s lease is valid, preventing two nodes from simultaneously believing they are leader.”
}
},
{
“@type”: “Question”,
“name”: “Can reads always be served from followers?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Reads can go to followers when the application tolerates stale data. For read-your-writes, reads must go to the leader or wait until the follower’s apply_lsn catches up to the LSN of the client’s most recent write (session LSN tracking).”
}
},
{
“@type”: “Question”,
“name”: “How much data is lost during failover with async replication?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Data loss equals the promoted follower’s replication lag at the moment of leader failure. Minimizing this requires continuous lag monitoring, low-latency replication paths, and semi-sync replication for critical data where any loss is unacceptable.”
}
}
]
}

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How is leader election implemented in a leader-follower system?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Leader election is typically implemented with a consensus algorithm such as Raft or Paxos, where a candidate node requests votes from a quorum of peers and becomes leader only after receiving a majority. The elected leader then sends periodic heartbeats to suppress new elections while it remains healthy.”
}
},
{
“@type”: “Question”,
“name”: “How does the follower apply log entries from the leader?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The leader appends each write to its local log and streams log entries to followers via AppendEntries RPCs; a follower writes the entry to its own log and acknowledges receipt. Once the leader receives acknowledgements from a quorum, it marks the entry committed and notifies followers, which then apply it to their state machines in order.”
}
},
{
“@type”: “Question”,
“name”: “How are read requests routed in a leader-follower setup?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Strong-consistency reads are served by the leader so they always reflect the latest committed state, while stale reads can be offloaded to followers to distribute read throughput. The trade-off is that follower reads may lag behind by the replication delay, so applications must tolerate eventual consistency for that path.”
}
},
{
“@type”: “Question”,
“name”: “How does automatic failover work when the leader fails?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “When followers stop receiving heartbeats within the election timeout, one or more of them start a new election by incrementing the term and requesting votes; the candidate with the most up-to-date log wins. The new leader then resumes accepting writes and replicating to the remaining followers, ensuring no committed entries are lost.”
}
}
]
}

See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering

See also: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence

See also: Databricks Interview Guide 2026: Spark Internals, Delta Lake, and Lakehouse Architecture

Scroll to Top