Question 1

How does Raft leader election work?

Accepted Answer

Raft divides time into terms (monotonically increasing integers). A follower that doesn't receive a leader heartbeat within election_timeout (150-300ms, randomized per node) becomes a candidate: it increments its current term, votes for itself, and sends RequestVote RPCs to all peers. A node grants its vote if the candidate's term >= its own current term, it hasn't voted in this term, and the candidate's log is at least as up-to-date as its own (comparing last log term, then last log index — prevents stale nodes from winning). The first candidate to receive votes from a strict majority (n/2 + 1) wins the election and becomes leader. If no candidate wins (split vote), all candidates wait a random timeout and start a new election. Randomized timeouts prevent persistent split votes.

Question 2

What is the log matching property in Raft and why does it matter?

Accepted Answer

The log matching property states: if two logs contain an entry with the same (term, index), then all entries up to that index are identical in both logs. This holds because the leader sends entries with their (term, index), and AppendEntries includes a consistency check — the follower rejects entries if the preceding (term, index) doesn't match. The safety guarantee built on this: any committed entry (acknowledged by a majority) will appear in all future leaders' logs. This is enforced by the election rule — a candidate can only win if its log is at least as up-to-date as a majority of nodes, ensuring the winner has all committed entries. Without this property, a leader could be elected without committed entries and overwrite them.

Question 3

What happens to uncommitted entries when a Raft leader fails?

Accepted Answer

When a leader crashes, some log entries may have been replicated to some followers but not committed (not yet acknowledged by a majority). These uncommitted entries may be overwritten by the new leader. This is safe: uncommitted entries were never applied to any state machine (the leader commits first, then notifies followers to apply). The new leader, which won the election by having the most up-to-date log, sends its version of entries to followers; followers overwrite any conflicting uncommitted entries with the leader's. Result: all nodes converge to the new leader's log, and only committed entries (which the new leader has) are applied to state machines. Clients whose requests had uncommitted entries must retry — Raft provides no acknowledgment until commit.

Question 4

Why do systems like etcd and CockroachDB use Raft instead of Paxos?

Accepted Answer

Raft was explicitly designed for understandability — Diego Ongaro's original paper (2014) is titled 'In Search of an Understandable Consensus Algorithm.' Paxos is notoriously difficult to implement correctly: the original Paxos paper describes single-value consensus; extending it to a replicated log (Multi-Paxos) requires significant additional design decisions not specified in the paper. Real implementations (Chubby, Zookeeper) deviate from the paper in ways that are hard to reason about. Raft provides a complete specification: leader election, log replication, membership changes, and log compaction are all defined. etcd (Kubernetes' backing store), CockroachDB, TiKV, Consul, and InfluxDB all chose Raft because a correct, production-ready implementation can be derived directly from the paper.

Raft Consensus Algorithm: Low-Level Design

Three Subproblems

Leader Election

Log Replication

Safety: Log Matching Property

Leader Failover

Performance Characteristics