Low Level Design: Paxos vs Raft Consensus Algorithms

⏱ 3 min read

Consensus algorithms allow a cluster of nodes to agree on a single value even when some nodes fail or messages are delayed. Consensus is the foundation of distributed coordination: etcd, ZooKeeper, CockroachDB, TiKV, and Kafka all use consensus for leader election, configuration management, and fault-tolerant log replication. Paxos was the first practical consensus algorithm (Lamport, 1989). Raft was designed specifically to be more understandable than Paxos, becoming the dominant algorithm in modern systems. Both guarantee safety (never decide conflicting values) and liveness (eventually make progress when enough nodes are available).

Paxos: Multi-Phase Agreement

Basic Paxos (single-value consensus): Phase 1 (Prepare): a proposer selects a unique proposal number N and sends Prepare(N) to a majority of acceptors. Each acceptor promises not to accept any proposal with a number less than N, and returns the highest-numbered proposal it has already accepted (if any). Phase 2 (Accept): if the proposer receives promises from a majority, it sends Accept(N, value) where value is either the value from the highest-numbered already-accepted proposal (if any acceptor reported one) or the proposer’s own value. Acceptors accept the proposal if they have not promised to ignore it. Learn: once a majority accepts a proposal, the value is decided. Multi-Paxos (for a log of decisions) elects a stable leader who skips Phase 1 for subsequent proposals, reducing round trips to one phase per log entry.

// Raft leader election summary
// Each node starts as a Follower with a random election timeout (150-300ms)
// If no heartbeat received in election timeout: become Candidate
//   - Increment term, vote for self, send RequestVote RPCs to all nodes
//   - If majority vote received: become Leader
//   - If another leader discovered (AppendEntries received): become Follower
//   - If election timeout expires without majority: start new election
// Leader sends AppendEntries (heartbeats + log entries) every heartbeat interval (~50ms)

// Raft log replication
// Leader receives client write: appends to local log
// Leader sends AppendEntries(term, prevLogIndex, prevLogTerm, entries[]) to followers
// Follower: if prevLogIndex/prevLogTerm match local log, append entries, return success
// Leader: once majority acknowledge, mark entry as committed, apply to state machine
// Leader notifies followers of commit index in next AppendEntries
// Followers apply committed entries to their state machines

Raft: Understandable Consensus

Raft decomposes consensus into three problems: Leader election: exactly one leader per term (election timeout + randomized timers prevent split votes). The leader has full authority — all writes go through the leader. Log replication: the leader appends client requests to its log, replicates to followers, and commits when a majority acknowledges. Followers reject entries that conflict with their log (log matching property). Safety: only nodes with an up-to-date log can win an election (vote granted only if candidate’s log is at least as up-to-date as the voter’s). This ensures committed entries are never lost during leader changes. Raft is easier to understand than Paxos because it has a single leader per term and separates leader election from log replication cleanly.

Practical Differences and When to Use Each

No practical system implements raw Paxos — real systems implement Multi-Paxos variants (Chubby, Zab in ZooKeeper) or Raft. Raft advantages: easier to understand, implement, and reason about correctness; membership changes are well-specified; log replication is straightforward. Paxos advantages: more flexible (no strict leader requirement), easier to optimize for WAN deployments (flexible quorums). In practice: choose Raft for new distributed systems requiring consensus (etcd, CockroachDB, TiKV use Raft). For interview discussions: understand Raft fully (leader election, log replication, commitment rule) and know that Paxos is the theoretical foundation. Byzantine fault tolerance (nodes that lie, not just crash) requires BFT algorithms (PBFT, Tendermint) — Raft and Paxos only tolerate crash failures.

Key Interview Discussion Points

Quorum size: a cluster of 2f+1 nodes can tolerate f failures (majority = f+1 nodes needed for consensus); 3-node cluster tolerates 1 failure; 5-node tolerates 2; adding nodes beyond 5 rarely justified (slower consensus, more network overhead)
Split-brain prevention: Raft prevents split-brain by requiring a majority for any decision — a network partition can only have one side with a majority, so only one partition continues to make progress
Leader lease: to serve reads without querying a quorum (avoiding a round trip), the leader can use a lease (valid for one election timeout duration) — it knows it is still leader and can serve reads locally
Log compaction and snapshots: the Raft log grows unboundedly; periodic snapshots of the state machine allow truncating old log entries; new nodes receive the snapshot + log tail instead of the full log history
etcd in Kubernetes: etcd stores all Kubernetes cluster state (pods, services, config maps); etcd uses Raft for consensus; Kubernetes recommends 3 or 5 etcd nodes; etcd performance is critical — fsync latency directly impacts API server write latency