Raft Consensus Algorithm Low-Level Design: Leader Election, Log Replication, and Safety Guarantees

Raft is the consensus algorithm designed to be understandable. It powers etcd (Kubernetes backbone), CockroachDB, TiKV, and many other systems that require strongly consistent distributed state. This post covers the complete low-level design: leader election, log replication, safety guarantees, log compaction, and membership changes.

Roles and Terms

Every node is in one of three roles:

Leader: handles all client writes; replicates log entries to followers; sends periodic heartbeats
Follower: passive; replicates log entries from the leader; resets election timeout on heartbeat receipt
Candidate: transitional state during election

Terms are monotonically increasing logical clocks. Every message carries the sender's current term. If a node receives a message with a higher term, it immediately reverts to follower and updates its term. A node refuses any message from a lower-term sender.

Leader Election

A follower becomes a candidate when its election timeout fires (no heartbeat received). Election timeouts are randomized (e.g., 150–300 ms) to prevent all followers timing out simultaneously and splitting votes.

On becoming a candidate:

Increment current term
Vote for self
Send RequestVote RPC to all other nodes, including own last log index and term

A node grants a vote if: (a) it has not voted in this term yet, and (b) the candidate's log is at least as up-to-date as the voter's log. Log up-to-date comparison: higher last log term wins; if equal, higher last log index wins.

The candidate wins if it receives votes from a majority (N/2 + 1). On winning, it immediately sends heartbeats to establish authority and prevent new elections.

Log Replication

All client writes go to the leader. The leader:

Appends the entry to its local log with the current term
Sends AppendEntries RPC to all followers in parallel
Once a majority acknowledge the entry, marks it committed by advancing commit_index
Applies the committed entry to the state machine
Returns success to the client

Followers apply entries to their state machine once the leader's commit_index advances past those entries (communicated via the next AppendEntries or heartbeat).

Consistency check: AppendEntries includes the previous log entry's index and term. If a follower's log does not match, it rejects the RPC. The leader retries with earlier entries until a matching point is found, then overwrites divergent follower entries.

Safety Guarantees

Election restriction: a candidate cannot win election without having all committed entries in its log. Since committed entries exist on a majority of nodes, and the candidate must receive votes from a majority, at least one voter has the committed entries. The up-to-date check ensures the candidate has them too.

Log Matching Property: if two log entries in different nodes have the same index and term, they are identical, and all preceding entries are identical. This is guaranteed because the leader creates at most one entry per index per term, and followers reject conflicting entries.

Leader Completeness: if an entry is committed in term T, every future leader elected in term T+1 or later will have that entry in its log.

Log Compaction with Snapshots

The Raft log grows indefinitely. Periodic snapshots compact it:

The state machine serializes its current state to a snapshot
The snapshot records last_included_index and last_included_term
Log entries before the snapshot point are discarded

When a follower is too far behind to catch up via AppendEntries (its log has been compacted), the leader sends an InstallSnapshot RPC with the full snapshot.

Membership Changes

Changing the cluster membership (adding or removing nodes) is unsafe if done abruptly — two disjoint majorities might form. Raft uses joint consensus: for the transition period, both the old configuration C_old and new configuration C_new must each independently form a majority. Only after the joint consensus entry is committed does the cluster switch to C_new alone.

SQL Schema

-- Raft log (persisted to stable storage)
CREATE TABLE RaftLog (
    index     BIGINT PRIMARY KEY,
    term      BIGINT NOT NULL,
    command   BYTEA NOT NULL,
    committed BOOLEAN NOT NULL DEFAULT FALSE
);

-- Persistent Raft node state (survived crashes)
CREATE TABLE RaftState (
    node_id      TEXT PRIMARY KEY,
    current_term BIGINT NOT NULL DEFAULT 0,
    voted_for    TEXT,         -- node_id of last vote granted in current_term
    commit_index BIGINT NOT NULL DEFAULT 0,
    last_applied BIGINT NOT NULL DEFAULT 0
);

-- Peer registry and real-time role tracking
CREATE TABLE RaftPeer (
    node_id     TEXT PRIMARY KEY,
    address     TEXT NOT NULL,
    role        TEXT NOT NULL DEFAULT 'follower',
    last_seen_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    CONSTRAINT chk_role CHECK (role IN ('leader','follower','candidate'))
);

Python Implementation

import random
import time
import threading
from dataclasses import dataclass, field
from typing import Optional


@dataclass
class LogEntry:
    index: int
    term: int
    command: bytes


@dataclass
class VoteRequest:
    term: int
    candidate_id: str
    last_log_index: int
    last_log_term: int


@dataclass
class AppendEntriesRequest:
    term: int
    leader_id: str
    prev_log_index: int
    prev_log_term: int
    entries: list[LogEntry]
    leader_commit: int


class RaftNode:
    ELECTION_TIMEOUT_MIN = 0.150  # seconds
    ELECTION_TIMEOUT_MAX = 0.300
    HEARTBEAT_INTERVAL  = 0.050

    def __init__(self, node_id: str, peers: list[str]):
        self.node_id = node_id
        self.peers = peers

        # Persistent state
        self.current_term = 0
        self.voted_for: Optional[str] = None
        self.log: list[LogEntry] = []

        # Volatile state
        self.commit_index = 0
        self.last_applied = 0
        self.role = "follower"
        self.leader_id: Optional[str] = None

        # Leader-only volatile state
        self.next_index: dict[str, int] = {}
        self.match_index: dict[str, int] = {}

        self._lock = threading.Lock()
        self._election_timer = self._reset_election_timer()

    def start_election(self) -> None:
        """Transition to candidate, increment term, request votes."""
        with self._lock:
            self.role = "candidate"
            self.current_term += 1
            self.voted_for = self.node_id
            votes = 1  # vote for self
            last_index = len(self.log)
            last_term = self.log[-1].term if self.log else 0

        req = VoteRequest(
            term=self.current_term,
            candidate_id=self.node_id,
            last_log_index=last_index,
            last_log_term=last_term
        )

        for peer in self.peers:
            granted = self._send_vote_request(peer, req)
            if granted:
                votes += 1

        majority = (len(self.peers) + 1) // 2 + 1
        if votes >= majority:
            self._become_leader()

    def vote_for(self, candidate_id: str, term: int, candidate_last_index: int, candidate_last_term: int) -> bool:
        """Grant vote if candidate log is at least as up-to-date."""
        with self._lock:
            if term  self.current_term:
                self.current_term = term
                self.voted_for = None
                self.role = "follower"
            if self.voted_for not in (None, candidate_id):
                return False
            # Up-to-date check
            my_last_term = self.log[-1].term if self.log else 0
            my_last_index = len(self.log)
            log_ok = (candidate_last_term > my_last_term or
                      (candidate_last_term == my_last_term and candidate_last_index >= my_last_index))
            if not log_ok:
                return False
            self.voted_for = candidate_id
            return True

    def append_entries(self, req: AppendEntriesRequest) -> bool:
        """Process AppendEntries RPC from leader."""
        with self._lock:
            if req.term  0:
                if len(self.log) = entry.index:
                    if self.log[entry.index - 1].term != entry.term:
                        self.log = self.log[:entry.index - 1]
                    else:
                        continue
                self.log.append(entry)

            # Advance commit index
            if req.leader_commit > self.commit_index:
                self.commit_index = min(req.leader_commit, len(self.log))

            return True

    def apply_state_machine(self, entry: LogEntry) -> None:
        """Apply a committed log entry to the application state machine."""
        # Application-specific: parse and execute entry.command
        self.last_applied = entry.index

    def take_snapshot(self) -> dict:
        """Serialize current state machine state as a snapshot."""
        with self._lock:
            return {
                "last_included_index": self.last_applied,
                "last_included_term": self.log[self.last_applied - 1].term if self.log else 0,
                "state": {}  # Application state machine serialized here
            }

    def _become_leader(self) -> None:
        with self._lock:
            self.role = "leader"
            self.leader_id = self.node_id
            for peer in self.peers:
                self.next_index[peer] = len(self.log) + 1
                self.match_index[peer] = 0

    def _reset_election_timer(self) -> float:
        timeout = random.uniform(self.ELECTION_TIMEOUT_MIN, self.ELECTION_TIMEOUT_MAX)
        return timeout

    def _send_vote_request(self, peer: str, req: VoteRequest) -> bool:
        # Send RPC to peer; return True if vote granted
        return False  # Placeholder

    def _send_heartbeat(self, peer: str) -> None:
        # Send empty AppendEntries as heartbeat
        pass  # Placeholder

Split Vote Prevention and Recovery

Randomized election timeouts are the primary mechanism for preventing split votes. If two candidates start elections simultaneously, one will time out and restart before the other, giving the faster one time to collect a majority. In practice, split votes are rare and resolve automatically within one or two timeout cycles.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How does Raft prevent split votes during leader election?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Raft uses randomized election timeouts (e.g., 150–300 ms chosen uniformly at random per node). When followers time out at different times, the first to time out becomes a candidate and starts collecting votes before others time out. This probabilistically prevents two candidates from starting simultaneously. In the rare case of a split vote where no candidate gets a majority, all candidates time out again with new random delays, quickly resolving the split.”
}
},
{
“@type”: “Question”,
“name”: “What is the Log Matching Property in Raft?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The Log Matching Property states that if two log entries in any two nodes have the same index and term, then those entries are identical, and all preceding entries are also identical. This is guaranteed because the leader creates at most one log entry per index per term, and the AppendEntries consistency check verifies that the preceding entry matches before accepting new entries. This property ensures logs across the cluster are consistent prefixes of each other.”
}
},
{
“@type”: “Question”,
“name”: “When should a Raft leader take a snapshot for log compaction?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A snapshot should be triggered when the Raft log grows beyond a configured size threshold (e.g., 100 MB or 100,000 entries). The snapshot captures the state machine state at the current last_applied index, and all log entries up to that index are discarded. Additionally, when a follower is so far behind that the leader has already discarded the log entries needed to catch it up, the leader sends an InstallSnapshot RPC instead of AppendEntries.”
}
},
{
“@type”: “Question”,
“name”: “How does Raft handle membership changes safely?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Raft uses joint consensus for membership changes. During the transition from old configuration C_old to new configuration C_new, both configurations must independently form a majority for any decision. This joint consensus phase is itself committed as a log entry. Only after the joint consensus entry is committed does the cluster finalize the switch to C_new alone. This prevents two disjoint majorities from forming — the split-brain scenario that makes direct configuration switches unsafe.”
}
}
]
}

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How does randomized election timeout prevent split votes?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Each follower waits a random timeout before starting an election (e.g., 150-300ms); the first node to time out usually wins before others start campaigning; randomization makes simultaneous elections statistically rare, converging quickly to a single leader.”
}
},
{
“@type”: “Question”,
“name”: “What does the Log Matching Property guarantee?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “If two logs contain an entry with the same index and term, all entries before that index are identical; this is enforced by AppendEntries consistency check — a follower rejects entries if its log does not match the leader's prevLogIndex and prevLogTerm.”
}
},
{
“@type”: “Question”,
“name”: “How does Raft ensure committed entries are never lost?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The Election Restriction requires a candidate to have a log at least as up-to-date as any majority member; since committed entries are present on a majority of servers, any new leader will have them, guaranteeing they are never overwritten.”
}
},
{
“@type”: “Question”,
“name”: “How does snapshot-based log compaction work?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The state machine takes a snapshot of its current state and records the last included log index and term; all log entries up to that index are discarded; new followers receive the snapshot via InstallSnapshot RPC instead of replaying the full log.”
}
}
]
}