System Design: Distributed Consensus — Raft, Paxos, Leader Election, Log Replication, Split Brain, Quorum

Distributed consensus is the foundation that makes distributed systems reliable. Every replicated database, every coordination service, and every distributed lock depends on consensus to ensure all nodes agree on the same state. This guide covers Raft and Paxos — the two most important consensus algorithms — with practical focus on how they work, when they fail, and how production systems like etcd, CockroachDB, and TiKV use them.

Why Consensus Is Hard

The fundamental problem: multiple nodes must agree on a value (or sequence of values) despite network partitions, message delays, message loss, and node crashes. The FLP impossibility result (1985) proved that no deterministic consensus algorithm can guarantee termination in an asynchronous system where even one node can crash. In practice, consensus algorithms work around this by using timeouts and randomization — they sacrifice guaranteed termination in the worst case for practical reliability in typical conditions. The CAP theorem constrains the design space further: during a network partition, you must choose between consistency (all nodes see the same data) and availability (all nodes can respond). Consensus algorithms choose consistency — a minority partition cannot process writes, ensuring no conflicting writes occur.

Raft: Understandable Consensus

Raft was designed by Diego Ongaro in 2014 specifically to be understandable, in contrast to Paxos which is notoriously difficult to implement correctly. Raft decomposes consensus into three subproblems: leader election, log replication, and safety. A Raft cluster has N nodes (typically 3 or 5). At any time, one node is the leader and the rest are followers. The leader receives all client writes, appends them to its log, and replicates them to followers. A write is committed when a majority (quorum) of nodes have persisted it. For a 5-node cluster, the quorum is 3 — the system tolerates 2 node failures. For a 3-node cluster, the quorum is 2 — it tolerates 1 failure.

Raft Leader Election

Each node starts as a follower. Followers expect periodic heartbeats from the leader. If a follower receives no heartbeat within the election timeout (randomized between 150-300ms to prevent split votes), it becomes a candidate and starts an election: it increments its current term (a monotonically increasing logical clock), votes for itself, and sends RequestVote RPCs to all other nodes. A node votes for a candidate if: (1) the candidate term is higher than the voter current term, and (2) the candidate log is at least as up-to-date as the voter log (last log entry term >= voter last log entry term, or same term with >= index). This ensures the elected leader has all committed entries. If a candidate receives votes from a majority, it becomes leader and immediately sends heartbeats to all followers to establish authority and prevent new elections. If two candidates split the vote (no majority), both time out and start a new election with incremented terms — the randomized timeout makes repeated splits unlikely.

Raft Log Replication

The leader receives client requests and appends each as a new entry in its log. Each entry contains: the term in which it was received, its index in the log, and the command (state machine operation). The leader sends AppendEntries RPCs to all followers containing the new entries and the leader commit index. Followers append the entries to their logs and respond with success. When the leader receives success from a majority (including itself), the entry is committed — it will never be lost. The leader advances its commit index and applies the committed entry to its state machine. Followers learn of committed entries via the leader commit index in subsequent AppendEntries RPCs and apply them to their own state machines. Log matching property: if two logs contain an entry with the same index and term, then all entries up to that index are identical. This is maintained by a consistency check in AppendEntries: the leader includes the term and index of the entry immediately preceding the new entries. If the follower does not have a matching entry, it rejects the RPC, and the leader backs up and retries with earlier entries until the logs converge.

Paxos: The Original Consensus Algorithm

Paxos, invented by Leslie Lamport in 1989 (published 1998), solves single-value consensus. Three roles: proposers (suggest values), acceptors (vote on proposals), and learners (learn the decided value). Phase 1 (Prepare): a proposer selects a proposal number N (higher than any it has seen) and sends Prepare(N) to a majority of acceptors. Each acceptor responds with a promise to not accept any proposal numbered less than N, and includes the highest-numbered proposal it has already accepted (if any). Phase 2 (Accept): if the proposer receives promises from a majority, it sends Accept(N, V) where V is: the value from the highest-numbered accepted proposal in the responses (if any), or the proposer own value (if no acceptor had accepted anything). Acceptors accept the proposal if they have not promised to a higher-numbered proposal. When a majority accepts, the value is chosen. Multi-Paxos extends this to a sequence of values (a replicated log) by running Paxos instances in sequence and optimizing the common case with a stable leader.

Split Brain Prevention

Split brain occurs when two nodes both believe they are the leader, potentially making conflicting decisions. Consensus algorithms prevent split brain through quorum overlap: any two majorities overlap in at least one node. In a 5-node cluster, any majority requires 3 nodes. Two groups of 3 share at least 1 node. This shared node ensures that at most one leader can be elected in any term — a node votes for at most one candidate per term. Network partition scenario: a 5-node cluster splits into a group of 3 and a group of 2. The group of 3 can elect a leader (has a majority) and continue processing writes. The group of 2 cannot elect a leader (no majority) and becomes read-only or unavailable. When the partition heals, the group of 2 catches up by replicating the log entries it missed. No conflicting writes occurred during the partition. This is why odd cluster sizes (3, 5, 7) are used — even sizes (4, 6) do not improve fault tolerance (a 4-node cluster tolerates 1 failure, same as 3).

Consensus in Production Systems

Production systems using Raft: etcd (Kubernetes coordination), CockroachDB (distributed SQL — uses Raft per range for data replication), TiKV (distributed key-value store), Consul (service discovery), and HashiCorp Vault (HA backend). Production systems using Paxos/Multi-Paxos: Google Spanner (uses a Paxos variant called TrueTime-synchronized Paxos), Google Chubby (lock service), and Apache ZooKeeper (uses ZAB, a protocol similar to Multi-Paxos). Performance characteristics: Raft requires a network round-trip to a majority for each write. In a same-datacenter deployment with sub-millisecond network latency, this adds 1-2ms per write. Cross-datacenter Raft (e.g., CockroachDB multi-region) adds the inter-datacenter latency (50-200ms) to each write — this is the fundamental cost of strong consistency across regions. Read optimization: by default, reads also go through the leader for linearizability. Read leases allow followers to serve reads if the leader lease has not expired, trading linearizability for lower read latency.

{ “@context”: “https://schema.org”, “@type”: “FAQPage”, “mainEntity”: [ { “@type”: “Question”, “name”: “What is the difference between Raft and Paxos consensus algorithms?”, “acceptedAnswer”: { “@type”: “Answer”, “text”: “Paxos (1989) and Raft (2014) solve the same problem — getting distributed nodes to agree on a sequence of values — but differ in design philosophy. Paxos is described as a single-decree protocol (agreeing on one value) and extended to Multi-Paxos for a sequence of values. The extension is underspecified, leading to many incompatible implementations. Paxos separates roles (proposer, acceptor, learner) and allows multiple concurrent proposers, making the protocol harder to reason about. Raft is designed as a complete replicated log protocol. It decomposes consensus into three clearly defined subproblems: leader election (one leader at a time), log replication (leader replicates entries to followers), and safety (guarantee that committed entries are never lost). Raft requires a stable leader — only the leader can propose entries, simplifying the protocol. In practice, both provide the same safety guarantees. Raft is easier to understand and implement correctly, which is why etcd, CockroachDB, and TiKV chose it. Paxos variants are used in Google Spanner and Chubby where the engineering teams have deep distributed systems expertise.” } }, { “@type”: “Question”, “name”: “Why do consensus clusters use odd numbers of nodes like 3 or 5?”, “acceptedAnswer”: { “@type”: “Answer”, “text”: “Consensus requires a majority (quorum) to make progress. For N nodes, the quorum is floor(N/2) + 1. A 3-node cluster has quorum 2 and tolerates 1 failure. A 4-node cluster has quorum 3 and also tolerates only 1 failure (if 2 nodes fail, only 2 remain, which is less than quorum 3). So 4 nodes provides the same fault tolerance as 3 nodes but requires one more server — wasted resources. Similarly: 5 nodes (quorum 3) tolerates 2 failures. 6 nodes (quorum 4) tolerates 2 failures. The pattern: adding an even-numbered node does not improve fault tolerance. Odd cluster sizes (3, 5, 7) are optimal because every node contributes to increasing the fault tolerance. Common deployments: 3 nodes for development and small production (tolerates 1 failure), 5 nodes for production systems requiring higher availability (tolerates 2 failures), 7 nodes rarely (the latency cost of waiting for 4 nodes to acknowledge each write is significant). Beyond 7, the performance overhead of consensus (each write requires majority acknowledgment) outweighs the availability benefit.” } }, { “@type”: “Question”, “name”: “How does Raft handle network partitions without causing split brain?”, “acceptedAnswer”: { “@type”: “Answer”, “text”: “Raft prevents split brain through the term mechanism and quorum requirement. Each leader election increments the term (a monotonically increasing logical clock). A node votes for at most one candidate per term. A candidate needs votes from a majority to become leader. During a network partition, the cluster splits into two or more groups. Only the group containing a majority of nodes can elect a leader (the minority group cannot gather enough votes). Example: 5-node cluster partitions into groups of 3 and 2. The group of 3 elects a new leader (quorum = 3, satisfied) and continues processing writes. The group of 2 cannot elect a leader and becomes unavailable for writes. Reads may still be served from the stale state on the minority side (depending on read consistency configuration). When the partition heals: the minority nodes discover the majority leader (with a higher term), accept its leadership, and replicate any log entries they missed during the partition. If the old leader was in the minority group, it steps down upon seeing a higher term. The key guarantee: at no point do two leaders in the same term process writes, so no conflicting state is created.” } }, { “@type”: “Question”, “name”: “What is linearizability and how does Raft provide it?”, “acceptedAnswer”: { “@type”: “Answer”, “text”: “Linearizability is the strongest consistency guarantee for a distributed system. It means that every operation appears to take effect atomically at some point between its invocation and response. From the client perspective, the system behaves as if there is a single copy of the data, even though it is replicated across multiple nodes. Raft provides linearizable writes because all writes go through the leader. The leader assigns a log index, replicates to a majority, and only then responds to the client. Once committed, the entry is durable and ordered. However, linearizable reads are not automatic. A naive read from the leader might return stale data if the leader has been deposed by a network partition but does not yet know it. Solutions: (1) Read through the log — treat reads as log entries that must be committed (round-trip to majority). Correct but slow. (2) ReadIndex — the leader confirms it is still the leader by sending a heartbeat to a majority before serving the read. No log entry needed, but still requires a network round-trip. (3) Lease-based reads — the leader holds a lease renewed by heartbeats. While the lease is valid, the leader serves reads locally. Faster but relies on bounded clock skew. etcd uses lease-based reads by default for performance.” } } ] }
Scroll to Top