Introduction
Consensus algorithms allow a cluster of nodes to agree on a sequence of values despite node failures. They are the foundation for distributed databases (etcd, CockroachDB, TiKV), coordination services (ZooKeeper), and replicated state machines. Without consensus, a distributed system cannot safely tolerate failures while maintaining consistency.
Raft Overview
Raft was designed to be more understandable than Paxos. It decomposes consensus into three relatively independent sub-problems: leader election, log replication, and safety. Each server in a Raft cluster is in one of three states: Leader, Follower, or Candidate. Time is divided into terms — monotonically increasing integers — each beginning with an election. At most one leader exists per term.
Raft Leader Election
All nodes start as Followers. If a Follower receives no heartbeat within its election timeout (randomized between 150–300ms), it becomes a Candidate. The Candidate increments its current term, votes for itself, and sends RequestVote RPCs to all other nodes. A node grants its vote if the candidate’s term is greater than or equal to its own, it has not already voted in this term, and the candidate’s log is at least as up-to-date as its own. A Candidate that receives votes from a majority (N/2 + 1) transitions to Leader and begins sending heartbeats to reset the election timers of all Followers.
Raft Log Replication
The Leader receives a client request, appends the entry to its own log, and sends AppendEntries RPCs to all Followers in parallel. Once a majority acknowledges the entry, the Leader commits it, applies it to its state machine, and responds to the client. Followers apply committed entries in order. The log matching property guarantees that if two logs contain an entry with the same index and term, then all preceding entries in both logs are identical — enforced by AppendEntries consistency checks.
Raft Safety
Raft provides several safety guarantees. Election safety: at most one leader can be elected per term. Log matching: AppendEntries includes a consistency check against the previous log entry; Followers reject requests that do not match. Leader completeness: a candidate cannot win an election unless its log contains all committed entries — RequestVote is rejected if the candidate’s log is less up-to-date. State machine safety: if any server applies a log entry at index i, no other server will ever apply a different entry at that same index.
Paxos Overview
Paxos, introduced by Lamport in 1989, was the first proven consensus algorithm. Single-decree Paxos reaches agreement on a single value. It defines three roles: Proposer, Acceptor, and Learner. The protocol proceeds in two phases: Prepare and Accept. Multi-Paxos extends single-decree Paxos to a replicated log by electing a stable leader and skipping the Prepare phase for subsequent rounds, reducing message latency to one round-trip in the common case.
Paxos Phases
Phase 1 — Prepare: the Proposer selects a proposal number n and sends Prepare(n) to a majority of Acceptors. Each Acceptor responds with a Promise not to accept any proposal numbered less than n, along with the highest-numbered proposal it has already accepted (if any). Phase 2 — Accept: the Proposer sends Accept(n, v), where v is the value of the highest-numbered accepted proposal received in Phase 1, or the Proposer’s own value if no accepted proposals were reported. An Acceptor accepts the proposal if it has not promised to ignore proposals numbered n or higher. A Learner learns the chosen value once a majority of Acceptors have accepted the same proposal number.
Practical Considerations
Raft is preferred in new systems due to its clarity and the availability of well-tested implementations. Multi-Raft enables horizontal scalability by assigning a separate Raft group to each data partition. The pre-vote extension prevents disruptive elections caused by partitioned nodes that increment their term before rejoining. Leader leases allow low-latency reads without a full round-trip to followers. Joint Consensus handles cluster membership changes — adding or removing nodes — safely by requiring agreement from both the old and new configurations during the transition.
Frequently Asked Questions: Consensus Algorithms (Raft and Paxos)
How does Raft’s leader election randomized timeout mechanism work?
Each Raft follower picks a randomized election timeout between 150ms and 300ms. When a follower doesn’t receive a heartbeat from the leader within that window, it increments its term, transitions to candidate state, and broadcasts RequestVote RPCs. Randomization staggers election starts across nodes, dramatically reducing the chance that two candidates split votes simultaneously and allowing one candidate to win a majority before others time out.
How does Raft log replication use majority acknowledgment to ensure durability?
The leader appends a new entry to its local log and sends AppendEntries RPCs to all followers in parallel. Once a majority (quorum) of nodes — including the leader — have written the entry to their logs and responded with success, the leader marks the entry as committed and applies it to the state machine. It then notifies followers of the new commit index on the next heartbeat so they can apply the entry locally. This majority requirement guarantees that any future elected leader will contain every committed entry.
How does Raft compare to Paxos in terms of understandability?
Raft was explicitly designed for understandability, decomposing consensus into three relatively independent sub-problems: leader election, log replication, and safety. It enforces a strong leader model where all writes flow through a single leader, simplifying reasoning about data flow. Paxos (and especially Multi-Paxos) is considered significantly harder to understand and implement correctly because the original paper leaves many practical details unspecified — such as leader election and log management — leading to wide implementation variation and subtle bugs in practice.
What is multi-Raft and how does it enable scalable partitioned systems?
Multi-Raft runs multiple independent Raft consensus groups (regions) across the same cluster of nodes, each group owning a contiguous key range or partition of data. Each node typically participates as a leader in some groups and a follower in others, distributing leader load evenly. This avoids the single-leader bottleneck of a single Raft group and allows horizontal scaling of both throughput and storage. Systems like TiKV and CockroachDB use multi-Raft to manage thousands of regions across large clusters.
What is a leader lease and how does it enable low-latency linearizable reads?
A leader lease is a time-bounded guarantee that a Raft leader holds exclusive leadership for a fixed interval (e.g., the election timeout duration). After winning election and confirming its lease is valid, the leader can serve read requests directly from its local state without issuing a round-trip ReadIndex or log entry — eliminating a network round trip. The lease relies on bounded clock drift between nodes; if clocks skew beyond the lease duration, the guarantee breaks, so this optimization requires careful clock discipline or a hardware clock assumption.
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering
See also: Databricks Interview Guide 2026: Spark Internals, Delta Lake, and Lakehouse Architecture