Leader Election Service Low-Level Design: Consensus, Failover Detection, and Epoch Fencing

Requirements and Constraints

A leader election service ensures that exactly one node in a cluster acts as leader at any time, coordinating work that must not be duplicated (e.g., shard assignment, cron job execution, master-replica promotion). Functional requirements: elect a leader within a bounded time after the previous leader fails, prevent split-brain (two simultaneous leaders), allow followers to discover the current leader, and fence stale leaders so they cannot perform writes with an outdated lease.

Constraints: the service must tolerate minority node failures (up to f failures in a 2f+1 node cluster), leadership changes must complete within a target of 5 seconds at p99, and the system must handle network partitions gracefully — nodes in the minority partition must not act as leader.

Core Data Model

Raft Persistent State (per node)

current_term (int64) — monotonically increasing election term; persisted to stable storage before any RPC response
voted_for (node_id or null) — candidate voted for in current term
log[] — ordered sequence of log entries: { term, index, command }

Raft Volatile State (per node)

commit_index — highest log index known to be committed
last_applied — highest log index applied to state machine
next_index[] (leader only) — for each follower, next log index to send
match_index[] (leader only) — highest log index known to be replicated on each follower

Leader Registry (ZooKeeper variant)

When using ZooKeeper instead of Raft, nodes compete to create an ephemeral sequential znode at /election/candidate-. The node with the lowest sequence number is leader. Each non-leader watches the node immediately preceding it. On deletion (crash), the next node gets a watch notification and checks if it's now the lowest — this avoids the herd effect of all nodes watching the leader directly.

Key Algorithms and Logic

Raft Leader Election

Followers start an election timer (randomized 150–300ms). If no heartbeat is received before timeout, the node increments its term, transitions to candidate, votes for itself, and sends RequestVote RPCs to all peers.
A node grants a vote if: it has not voted in this term, and the candidate's log is at least as up-to-date as its own (by comparing last log term, then last log index).
A candidate winning a majority of votes (N/2 + 1) transitions to leader and immediately sends empty AppendEntries (heartbeats) to assert authority and prevent new elections.
If a split vote occurs (no majority), candidates time out and start a new election with an incremented term. Randomized timeouts make split votes statistically rare.

Heartbeat-Based Failover Detection

The leader sends AppendEntries RPCs (heartbeats) every 50ms. Followers reset their election timer on every valid heartbeat. If no heartbeat arrives within the election timeout window (300ms default), the follower assumes the leader is dead and starts an election. Setting the heartbeat interval to roughly 1/10th of the election timeout provides a comfortable margin without excessive network traffic.

Epoch Fencing

Every leader action is tagged with the current Raft term (epoch). Resource servers reject any request from a node claiming leadership with a term lower than the highest term they have seen. This prevents a temporarily partitioned ex-leader from committing writes after a new leader has been elected in a higher term. Clients embed the leader's term in all write requests, and the state machine validates term before applying commands.

API Design

GET /leader — returns { node_id, address, term, lease_expires_at }; used by clients to discover the current leader.
POST /leader/resign — graceful leadership transfer; triggers a leadership transfer to the most up-to-date follower (Raft leadership transfer extension).
GET /cluster/status — returns all nodes, their roles, terms, and log indices; used for monitoring.
Internal RPC: RequestVote(term, candidateId, lastLogIndex, lastLogTerm) — returns { term, voteGranted }.
Internal RPC: AppendEntries(term, leaderId, prevLogIndex, prevLogTerm, entries[], leaderCommit) — returns { term, success }.

Scalability Considerations

Cluster size: 3 nodes tolerate 1 failure; 5 nodes tolerate 2 failures. Beyond 7 nodes, write latency increases due to quorum round-trips — prefer 5-node clusters for most deployments.
Read scalability: followers can serve stale reads; for linearizable reads, the leader must confirm its leadership with a quorum heartbeat before responding (ReadIndex protocol).
Geo-distribution: place nodes in odd numbers across regions; designate one region as majority to minimize cross-region RTT for quorum writes. Accept higher failover time for minority-region failures.
Log compaction: periodically snapshot state machine state and truncate the log to prevent unbounded log growth; send snapshots to lagging followers instead of replaying thousands of entries.
Lease-based reads: a leader with a known clock bound can serve reads without quorum confirmation during its lease period, reducing read latency at the cost of relying on bounded clock drift.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How does the Raft election algorithm select a new leader?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “When a follower's election timeout fires without receiving a heartbeat, it increments its term, transitions to candidate, and sends RequestVote RPCs to all peers. A node grants its vote only once per term to the first candidate whose log is at least as up-to-date as its own. The candidate that receives votes from a majority becomes leader and immediately sends heartbeats to suppress further elections.”
}
},
{
“@type”: “Question”,
“name”: “How does heartbeat-based failure detection work in a leader-election system?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The leader sends periodic heartbeat messages (empty AppendEntries in Raft) to every follower within the heartbeat interval. Each follower resets its election timeout on receipt. If a follower's timeout elapses—typically 10–50× the heartbeat interval—without a heartbeat, it infers leader failure and starts a new election. Tuning both intervals controls the trade-off between false positives and detection latency.”
}
},
{
“@type”: “Question”,
“name”: “What is epoch fencing and why is it necessary?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “An epoch (or term) is a monotonically increasing integer stamped on every message. When a new leader is elected it increments the epoch. Any request bearing a stale epoch from a deposed leader is rejected by followers and storage backends. This fences out zombie leaders—nodes that believe they are still leader due to a network partition—preventing them from issuing conflicting writes.”
}
},
{
“@type”: “Question”,
“name”: “How does Raft prevent split-brain?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Raft requires a strict majority (quorum) for both leader election and log entry commitment. In a partition, only the segment containing more than half the nodes can form a quorum, so at most one leader can exist at any term. The minority partition's leader, if any existed, cannot commit new entries and will step down once it rejoins and discovers a higher term.”
}
}
]
}