Single-Leader Model
In leader-follower (primary-replica) replication, all writes are routed to a single designated leader. The leader applies writes to its local write-ahead log (WAL) and then propagates log entries to one or more followers. Followers apply log entries in order, maintaining an apply_lsn (log sequence number) tracking how far they have caught up. This model provides strong write consistency — there is one source of truth for the write order — while enabling read scale-out across followers.
Replication Modes
Synchronous (Semi-Synchronous)
The leader waits for at least one follower to acknowledge receipt and persistence of each WAL entry before acknowledging the write to the client. This guarantees zero data loss on leader failure if at least one follower was synchronous. The tradeoff: write latency increases by at least one round-trip to the nearest synchronous follower. MySQL calls this semi-sync; PostgreSQL implements it via synchronous_commit = on with a named synchronous standby.
Asynchronous
The leader acknowledges the write after persisting it locally, before any follower confirms receipt. Writes are replicated to followers in the background. This minimizes write latency but creates a durability window: if the leader crashes before replication, the acknowledged write is lost. This is the PostgreSQL default (synchronous_commit = off or no synchronous standby listed).
Follower Application and Read Serving
Followers apply WAL entries in strict order. The gap between the leader's current LSN and a follower's apply_lsn is the replication lag. A follower can serve reads safely but with staleness proportional to its lag. Applications that can tolerate eventual consistency route reads to followers; applications requiring read-your-writes must route reads back to the leader or use a session-level LSN tracking mechanism to wait until the follower catches up.
Leader Lease
A leader holds a time-bounded lease — a commitment from the quorum that no new leader will be elected until the lease expires. The leader renews the lease by sending periodic heartbeats to a majority of followers. As long as a valid lease exists, followers refuse to initiate an election. This prevents split-brain during brief network partitions: a follower that cannot reach the leader cannot immediately declare itself leader if the leader's lease is still valid. The lease duration must be shorter than the heartbeat timeout to maintain the safety guarantee.
Leader Election
When a follower detects leader failure via heartbeat timeout, it transitions to candidate state. Raft-style election: a candidate requests votes from all peers. Each peer votes for the candidate only if the candidate's log is at least as up-to-date as the peer's own log (last log term and index comparison). A candidate that receives votes from a majority of nodes wins the election and becomes the new leader. The follower with the highest apply_lsn among candidates is most likely to win because it has the most complete log.
Failover Process
Automated failover sequence on leader failure:
- Health monitor detects no heartbeat within the timeout window.
- Identify the follower with the highest
apply_lsn— this minimizes potential data loss. - Promote that follower to leader (update its role, allow it to accept writes).
- Update the routing layer (DNS, VIP, or load balancer) to point write traffic at the new leader.
- Notify remaining followers to replicate from the new leader.
- If the old leader recovers, it must rejoin as a follower and sync from the new leader.
Replica Lag Monitoring
Continuous lag monitoring is essential for:
- SLA alerting: page if any follower lag exceeds the RPO budget.
- Read routing: exclude followers whose lag exceeds the application's staleness tolerance.
- Failover planning: the highest-LSN follower has the least potential data loss.
In PostgreSQL, pg_stat_replication exposes write_lag, flush_lag, and replay_lag per standby. These should be scraped into a time-series store (Prometheus, InfluxDB) and alerted on thresholds.
SQL Schema
CREATE TABLE ReplicationNode (
id BIGSERIAL PRIMARY KEY,
role VARCHAR(16) NOT NULL CHECK (role IN ('leader', 'follower')),
address VARCHAR(256) NOT NULL,
apply_lsn BIGINT NOT NULL DEFAULT 0,
last_heartbeat_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE TABLE LeaderLease (
id BIGSERIAL PRIMARY KEY,
leader_id BIGINT NOT NULL REFERENCES ReplicationNode(id),
lease_granted_at TIMESTAMPTZ NOT NULL DEFAULT now(),
lease_expires_at TIMESTAMPTZ NOT NULL
);
CREATE TABLE FailoverLog (
id BIGSERIAL PRIMARY KEY,
old_leader_id BIGINT NOT NULL REFERENCES ReplicationNode(id),
new_leader_id BIGINT NOT NULL REFERENCES ReplicationNode(id),
trigger VARCHAR(128) NOT NULL,
promoted_at TIMESTAMPTZ NOT NULL DEFAULT now(),
lag_at_promotion BIGINT
);
CREATE INDEX idx_rn_role ON ReplicationNode (role);
CREATE INDEX idx_rn_heartbeat ON ReplicationNode (last_heartbeat_at DESC);
CREATE INDEX idx_lease_leader ON LeaderLease (leader_id, lease_expires_at DESC);
Python Implementation
import time
import threading
from dataclasses import dataclass, field
from typing import Optional, List
@dataclass
class Node:
node_id: int
address: str
role: str # "leader" or "follower"
apply_lsn: int = 0
last_heartbeat: float = field(default_factory=time.time)
class LeaderElector:
HEARTBEAT_INTERVAL = 1.0
LEASE_DURATION = 5.0
ELECTION_TIMEOUT = 3.0
def __init__(self, nodes: List[Node], self_id: int):
self.nodes = {n.node_id: n for n in nodes}
self.self_id = self_id
self._lease_expiry: Optional[float] = None
self._lock = threading.Lock()
@property
def self_node(self):
return self.nodes[self.self_id]
def check_lease(self) -> bool:
with self._lock:
return (self._lease_expiry is not None and
time.time() Optional[int]:
if self.check_lease():
return None
candidates = sorted(
self.nodes.values(),
key=lambda n: n.apply_lsn,
reverse=True
)
best = candidates[0]
votes = sum(1 for n in self.nodes.values() if n.apply_lsn = quorum:
return best.node_id
return None
def promote_to_leader(self, node_id: int):
with self._lock:
for n in self.nodes.values():
n.role = 'follower'
self.nodes[node_id].role = 'leader'
self._lease_expiry = time.time() + self.LEASE_DURATION
print(f"Node {node_id} promoted to leader.")
def monitor_lag(self, threshold_lsn: int = 1000) -> List[int]:
leader = next((n for n in self.nodes.values() if n.role == 'leader'), None)
if not leader:
return []
lagging = [
n.node_id for n in self.nodes.values()
if n.role == 'follower' and (leader.apply_lsn - n.apply_lsn) > threshold_lsn
]
if lagging:
print(f"WARNING: lagging followers: {lagging}")
return lagging
Read-Your-Writes Consistency
When a client writes to the leader and immediately reads from a follower, the follower may not yet have applied the write. Techniques to ensure read-your-writes:
- Route reads to leader: simplest but defeats the purpose of followers.
- Session LSN tracking: the leader returns the write LSN in its response; the client sends this LSN with subsequent reads; the router waits until the selected follower's
apply_lsnexceeds the session LSN before routing the read. - Sticky session: route all requests from a given client to the same follower; the follower's lag is consistent for that client.
Frequently Asked Questions
What is the difference between semi-sync and async replication?
In semi-sync replication, the leader waits for at least one follower to acknowledge the write before confirming to the client. This guarantees no data loss on leader failure as long as one follower was in sync. In async replication, the leader confirms the write after local persistence only, without waiting for any follower. Async replication has lower write latency but risks losing the most recent writes if the leader crashes before replication completes.
How does a leader lease prevent split-brain?
A leader lease is a time-bounded guarantee from a quorum that no new leader will be elected until the lease expires. If the leader is partitioned from followers but its lease is still valid, followers cannot legally elect a new leader. This prevents two nodes from simultaneously believing they are the leader. The lease must be shorter than the heartbeat timeout to ensure the leader can always renew it while healthy.
Can reads always be served from followers?
Reads can be served from followers when the application tolerates stale data (eventual consistency). For read-your-writes consistency — where a client must see its own recent writes — reads must either go to the leader, or the router must wait until the target follower's apply_lsn has caught up to the LSN of the client's most recent write. The latter approach (session LSN tracking) allows follower reads while preserving read-your-writes guarantees.
How much data is lost during failover with async replication?
The data loss window equals the replication lag of the promoted follower at the moment of leader failure. If the leader had advanced its LSN by 500 since the last replication to the best follower, those 500 log entries are lost. Minimizing this window requires monitoring lag continuously, keeping followers close to the leader (low latency replication path), and using semi-sync for critical data where any loss is unacceptable.
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What is the difference between semi-sync and async replication?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Semi-sync waits for at least one follower to acknowledge before confirming to the client, guaranteeing no data loss. Async confirms after local persistence only, with lower latency but risk of losing recent writes if the leader crashes before replication completes.”
}
},
{
“@type”: “Question”,
“name”: “How does a leader lease prevent split-brain?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A leader lease is a time-bounded quorum guarantee that no new leader will be elected until it expires. Partitioned followers cannot legally elect a new leader while the current leader’s lease is valid, preventing two nodes from simultaneously believing they are leader.”
}
},
{
“@type”: “Question”,
“name”: “Can reads always be served from followers?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Reads can go to followers when the application tolerates stale data. For read-your-writes, reads must go to the leader or wait until the follower’s apply_lsn catches up to the LSN of the client’s most recent write (session LSN tracking).”
}
},
{
“@type”: “Question”,
“name”: “How much data is lost during failover with async replication?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Data loss equals the promoted follower’s replication lag at the moment of leader failure. Minimizing this requires continuous lag monitoring, low-latency replication paths, and semi-sync replication for critical data where any loss is unacceptable.”
}
}
]
}
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How is leader election implemented in a leader-follower system?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Leader election is typically implemented with a consensus algorithm such as Raft or Paxos, where a candidate node requests votes from a quorum of peers and becomes leader only after receiving a majority. The elected leader then sends periodic heartbeats to suppress new elections while it remains healthy.”
}
},
{
“@type”: “Question”,
“name”: “How does the follower apply log entries from the leader?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The leader appends each write to its local log and streams log entries to followers via AppendEntries RPCs; a follower writes the entry to its own log and acknowledges receipt. Once the leader receives acknowledgements from a quorum, it marks the entry committed and notifies followers, which then apply it to their state machines in order.”
}
},
{
“@type”: “Question”,
“name”: “How are read requests routed in a leader-follower setup?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Strong-consistency reads are served by the leader so they always reflect the latest committed state, while stale reads can be offloaded to followers to distribute read throughput. The trade-off is that follower reads may lag behind by the replication delay, so applications must tolerate eventual consistency for that path.”
}
},
{
“@type”: “Question”,
“name”: “How does automatic failover work when the leader fails?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “When followers stop receiving heartbeats within the election timeout, one or more of them start a new election by incrementing the term and requesting votes; the candidate with the most up-to-date log wins. The new leader then resumes accepting writes and replicating to the remaining followers, ensuring no committed entries are lost.”
}
}
]
}
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering
See also: Databricks Interview Guide 2026: Spark Internals, Delta Lake, and Lakehouse Architecture