Raft is the consensus algorithm designed to be understandable. It powers etcd (Kubernetes backbone), CockroachDB, TiKV, and many other systems that require strongly consistent distributed state. This post covers the complete low-level design: leader election, log replication, safety guarantees, log compaction, and membership changes.
Roles and Terms
Every node is in one of three roles:
- Leader: handles all client writes; replicates log entries to followers; sends periodic heartbeats
- Follower: passive; replicates log entries from the leader; resets election timeout on heartbeat receipt
- Candidate: transitional state during election
Terms are monotonically increasing logical clocks. Every message carries the sender's current term. If a node receives a message with a higher term, it immediately reverts to follower and updates its term. A node refuses any message from a lower-term sender.
Leader Election
A follower becomes a candidate when its election timeout fires (no heartbeat received). Election timeouts are randomized (e.g., 150–300 ms) to prevent all followers timing out simultaneously and splitting votes.
On becoming a candidate:
- Increment current term
- Vote for self
- Send RequestVote RPC to all other nodes, including own last log index and term
A node grants a vote if: (a) it has not voted in this term yet, and (b) the candidate's log is at least as up-to-date as the voter's log. Log up-to-date comparison: higher last log term wins; if equal, higher last log index wins.
The candidate wins if it receives votes from a majority (N/2 + 1). On winning, it immediately sends heartbeats to establish authority and prevent new elections.
Log Replication
All client writes go to the leader. The leader:
- Appends the entry to its local log with the current term
- Sends AppendEntries RPC to all followers in parallel
- Once a majority acknowledge the entry, marks it committed by advancing
commit_index - Applies the committed entry to the state machine
- Returns success to the client
Followers apply entries to their state machine once the leader's commit_index advances past those entries (communicated via the next AppendEntries or heartbeat).
Consistency check: AppendEntries includes the previous log entry's index and term. If a follower's log does not match, it rejects the RPC. The leader retries with earlier entries until a matching point is found, then overwrites divergent follower entries.
Safety Guarantees
Election restriction: a candidate cannot win election without having all committed entries in its log. Since committed entries exist on a majority of nodes, and the candidate must receive votes from a majority, at least one voter has the committed entries. The up-to-date check ensures the candidate has them too.
Log Matching Property: if two log entries in different nodes have the same index and term, they are identical, and all preceding entries are identical. This is guaranteed because the leader creates at most one entry per index per term, and followers reject conflicting entries.
Leader Completeness: if an entry is committed in term T, every future leader elected in term T+1 or later will have that entry in its log.
Log Compaction with Snapshots
The Raft log grows indefinitely. Periodic snapshots compact it:
- The state machine serializes its current state to a snapshot
- The snapshot records
last_included_indexandlast_included_term - Log entries before the snapshot point are discarded
When a follower is too far behind to catch up via AppendEntries (its log has been compacted), the leader sends an InstallSnapshot RPC with the full snapshot.
Membership Changes
Changing the cluster membership (adding or removing nodes) is unsafe if done abruptly — two disjoint majorities might form. Raft uses joint consensus: for the transition period, both the old configuration C_old and new configuration C_new must each independently form a majority. Only after the joint consensus entry is committed does the cluster switch to C_new alone.
SQL Schema
-- Raft log (persisted to stable storage)
CREATE TABLE RaftLog (
index BIGINT PRIMARY KEY,
term BIGINT NOT NULL,
command BYTEA NOT NULL,
committed BOOLEAN NOT NULL DEFAULT FALSE
);
-- Persistent Raft node state (survived crashes)
CREATE TABLE RaftState (
node_id TEXT PRIMARY KEY,
current_term BIGINT NOT NULL DEFAULT 0,
voted_for TEXT, -- node_id of last vote granted in current_term
commit_index BIGINT NOT NULL DEFAULT 0,
last_applied BIGINT NOT NULL DEFAULT 0
);
-- Peer registry and real-time role tracking
CREATE TABLE RaftPeer (
node_id TEXT PRIMARY KEY,
address TEXT NOT NULL,
role TEXT NOT NULL DEFAULT 'follower',
last_seen_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
CONSTRAINT chk_role CHECK (role IN ('leader','follower','candidate'))
);
Python Implementation
import random
import time
import threading
from dataclasses import dataclass, field
from typing import Optional
@dataclass
class LogEntry:
index: int
term: int
command: bytes
@dataclass
class VoteRequest:
term: int
candidate_id: str
last_log_index: int
last_log_term: int
@dataclass
class AppendEntriesRequest:
term: int
leader_id: str
prev_log_index: int
prev_log_term: int
entries: list[LogEntry]
leader_commit: int
class RaftNode:
ELECTION_TIMEOUT_MIN = 0.150 # seconds
ELECTION_TIMEOUT_MAX = 0.300
HEARTBEAT_INTERVAL = 0.050
def __init__(self, node_id: str, peers: list[str]):
self.node_id = node_id
self.peers = peers
# Persistent state
self.current_term = 0
self.voted_for: Optional[str] = None
self.log: list[LogEntry] = []
# Volatile state
self.commit_index = 0
self.last_applied = 0
self.role = "follower"
self.leader_id: Optional[str] = None
# Leader-only volatile state
self.next_index: dict[str, int] = {}
self.match_index: dict[str, int] = {}
self._lock = threading.Lock()
self._election_timer = self._reset_election_timer()
def start_election(self) -> None:
"""Transition to candidate, increment term, request votes."""
with self._lock:
self.role = "candidate"
self.current_term += 1
self.voted_for = self.node_id
votes = 1 # vote for self
last_index = len(self.log)
last_term = self.log[-1].term if self.log else 0
req = VoteRequest(
term=self.current_term,
candidate_id=self.node_id,
last_log_index=last_index,
last_log_term=last_term
)
for peer in self.peers:
granted = self._send_vote_request(peer, req)
if granted:
votes += 1
majority = (len(self.peers) + 1) // 2 + 1
if votes >= majority:
self._become_leader()
def vote_for(self, candidate_id: str, term: int, candidate_last_index: int, candidate_last_term: int) -> bool:
"""Grant vote if candidate log is at least as up-to-date."""
with self._lock:
if term self.current_term:
self.current_term = term
self.voted_for = None
self.role = "follower"
if self.voted_for not in (None, candidate_id):
return False
# Up-to-date check
my_last_term = self.log[-1].term if self.log else 0
my_last_index = len(self.log)
log_ok = (candidate_last_term > my_last_term or
(candidate_last_term == my_last_term and candidate_last_index >= my_last_index))
if not log_ok:
return False
self.voted_for = candidate_id
return True
def append_entries(self, req: AppendEntriesRequest) -> bool:
"""Process AppendEntries RPC from leader."""
with self._lock:
if req.term 0:
if len(self.log) = entry.index:
if self.log[entry.index - 1].term != entry.term:
self.log = self.log[:entry.index - 1]
else:
continue
self.log.append(entry)
# Advance commit index
if req.leader_commit > self.commit_index:
self.commit_index = min(req.leader_commit, len(self.log))
return True
def apply_state_machine(self, entry: LogEntry) -> None:
"""Apply a committed log entry to the application state machine."""
# Application-specific: parse and execute entry.command
self.last_applied = entry.index
def take_snapshot(self) -> dict:
"""Serialize current state machine state as a snapshot."""
with self._lock:
return {
"last_included_index": self.last_applied,
"last_included_term": self.log[self.last_applied - 1].term if self.log else 0,
"state": {} # Application state machine serialized here
}
def _become_leader(self) -> None:
with self._lock:
self.role = "leader"
self.leader_id = self.node_id
for peer in self.peers:
self.next_index[peer] = len(self.log) + 1
self.match_index[peer] = 0
def _reset_election_timer(self) -> float:
timeout = random.uniform(self.ELECTION_TIMEOUT_MIN, self.ELECTION_TIMEOUT_MAX)
return timeout
def _send_vote_request(self, peer: str, req: VoteRequest) -> bool:
# Send RPC to peer; return True if vote granted
return False # Placeholder
def _send_heartbeat(self, peer: str) -> None:
# Send empty AppendEntries as heartbeat
pass # Placeholder
Split Vote Prevention and Recovery
Randomized election timeouts are the primary mechanism for preventing split votes. If two candidates start elections simultaneously, one will time out and restart before the other, giving the faster one time to collect a majority. In practice, split votes are rare and resolve automatically within one or two timeout cycles.
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering
See also: Meta Interview Guide 2026: Facebook, Instagram, WhatsApp Engineering