Question 1

What is the difference between Raft and Paxos consensus algorithms?

Accepted Answer

Paxos (1989) and Raft (2014) solve the same problem -- getting distributed nodes to agree on a sequence of values -- but differ in design philosophy. Paxos is described as a single-decree protocol (agreeing on one value) and extended to Multi-Paxos for a sequence of values. The extension is underspecified, leading to many incompatible implementations. Paxos separates roles (proposer, acceptor, learner) and allows multiple concurrent proposers, making the protocol harder to reason about. Raft is designed as a complete replicated log protocol. It decomposes consensus into three clearly defined subproblems: leader election (one leader at a time), log replication (leader replicates entries to followers), and safety (guarantee that committed entries are never lost). Raft requires a stable leader -- only the leader can propose entries, simplifying the protocol. In practice, both provide the same safety guarantees. Raft is easier to understand and implement correctly, which is why etcd, CockroachDB, and TiKV chose it. Paxos variants are used in Google Spanner and Chubby where the engineering teams have deep distributed systems expertise.

Question 2

Why do consensus clusters use odd numbers of nodes like 3 or 5?

Accepted Answer

Consensus requires a majority (quorum) to make progress. For N nodes, the quorum is floor(N/2) + 1. A 3-node cluster has quorum 2 and tolerates 1 failure. A 4-node cluster has quorum 3 and also tolerates only 1 failure (if 2 nodes fail, only 2 remain, which is less than quorum 3). So 4 nodes provides the same fault tolerance as 3 nodes but requires one more server -- wasted resources. Similarly: 5 nodes (quorum 3) tolerates 2 failures. 6 nodes (quorum 4) tolerates 2 failures. The pattern: adding an even-numbered node does not improve fault tolerance. Odd cluster sizes (3, 5, 7) are optimal because every node contributes to increasing the fault tolerance. Common deployments: 3 nodes for development and small production (tolerates 1 failure), 5 nodes for production systems requiring higher availability (tolerates 2 failures), 7 nodes rarely (the latency cost of waiting for 4 nodes to acknowledge each write is significant). Beyond 7, the performance overhead of consensus (each write requires majority acknowledgment) outweighs the availability benefit.

Question 3

How does Raft handle network partitions without causing split brain?

Accepted Answer

Raft prevents split brain through the term mechanism and quorum requirement. Each leader election increments the term (a monotonically increasing logical clock). A node votes for at most one candidate per term. A candidate needs votes from a majority to become leader. During a network partition, the cluster splits into two or more groups. Only the group containing a majority of nodes can elect a leader (the minority group cannot gather enough votes). Example: 5-node cluster partitions into groups of 3 and 2. The group of 3 elects a new leader (quorum = 3, satisfied) and continues processing writes. The group of 2 cannot elect a leader and becomes unavailable for writes. Reads may still be served from the stale state on the minority side (depending on read consistency configuration). When the partition heals: the minority nodes discover the majority leader (with a higher term), accept its leadership, and replicate any log entries they missed during the partition. If the old leader was in the minority group, it steps down upon seeing a higher term. The key guarantee: at no point do two leaders in the same term process writes, so no conflicting state is created.

Question 4

What is linearizability and how does Raft provide it?

Accepted Answer

Linearizability is the strongest consistency guarantee for a distributed system. It means that every operation appears to take effect atomically at some point between its invocation and response. From the client perspective, the system behaves as if there is a single copy of the data, even though it is replicated across multiple nodes. Raft provides linearizable writes because all writes go through the leader. The leader assigns a log index, replicates to a majority, and only then responds to the client. Once committed, the entry is durable and ordered. However, linearizable reads are not automatic. A naive read from the leader might return stale data if the leader has been deposed by a network partition but does not yet know it. Solutions: (1) Read through the log -- treat reads as log entries that must be committed (round-trip to majority). Correct but slow. (2) ReadIndex -- the leader confirms it is still the leader by sending a heartbeat to a majority before serving the read. No log entry needed, but still requires a network round-trip. (3) Lease-based reads -- the leader holds a lease renewed by heartbeats. While the lease is valid, the leader serves reads locally. Faster but relies on bounded clock skew. etcd uses lease-based reads by default for performance.

System Design: Distributed Consensus — Raft, Paxos, Leader Election, Log Replication, Split Brain, Quorum

Why Consensus Is Hard

Raft: Understandable Consensus

Raft Leader Election

Raft Log Replication

Paxos: The Original Consensus Algorithm

Split Brain Prevention

Consensus in Production Systems