Low Level Design: Paxos Consensus Protocol

The Consensus Problem

Distributed systems need multiple nodes to agree on a single value — which server is the leader, whether a transaction committed, what the next log entry is. This is the consensus problem. The challenge: nodes can crash, messages can be delayed or reordered, but the system must still reach agreement and never contradict itself. Paxos, invented by Leslie Lamport, is the foundational algorithm that solves consensus in the presence of crash failures (but not Byzantine/malicious failures).

Single-Decree Paxos: Roles

Single-decree Paxos agrees on exactly one value. Three roles exist, and a single process can play multiple roles simultaneously:

Proposer: Initiates consensus by proposing a value. Drives the protocol forward through two phases.

Acceptor: Votes on proposals. Stores the highest proposal number it has promised to respect and the last value it accepted. Must persist this state durably — if an acceptor crashes and loses state, it may violate its promises.

Learner: Learns the final decided value once a quorum of acceptors has accepted it. Often the same nodes as acceptors in practice.

Phase 1: Prepare and Promise

The proposer selects a proposal number n that is higher than any it has used before, then broadcasts Prepare(n) to a quorum of acceptors (majority).

Each acceptor receiving Prepare(n) checks: is n greater than any proposal number it has already promised? If yes, it responds with Promise(n, accepted_value, accepted_n) — where accepted_value is the value from the highest-numbered proposal the acceptor has previously accepted (if any). Crucially, the promise means the acceptor will not accept any future proposal with number less than n. If n is not higher than what the acceptor already promised, it ignores or rejects the prepare.

Phase 2: Accept and Learn

Once the proposer collects promises from a quorum, it proceeds to phase 2. It selects the value to propose: if any promise included an accepted value, the proposer must use the value associated with the highest accepted_n among all promises. This is the critical safety rule — it ensures a previously accepted value is not overwritten. If no acceptor reported a prior accepted value, the proposer is free to use its own intended value.

The proposer broadcasts Accept(n, v) to the same quorum. Each acceptor accepts the proposal if n is still >= its current promised number, updating its stored accepted value. Once a quorum of acceptors has accepted the same proposal number, the value v is committed — learners are notified.

Why Quorums Guarantee Safety

Any two majorities of a 2F+1 node cluster overlap in at least one node. This overlap is the safety engine of Paxos. Suppose value V was accepted by a quorum Q1 in a prior round. Any new proposer must collect promises from some quorum Q2. Since Q1 and Q2 overlap, at least one acceptor in Q2 has already accepted V. That acceptor’s promise response will carry V with its proposal number, forcing the new proposer to adopt V rather than propose a different value. This prevents two different values from ever both being committed.

Liveness and Dueling Proposers

Paxos guarantees safety (no two values are ever decided) but not liveness in the presence of dueling proposers. Imagine proposer A sends Prepare(5). Before A sends Accept(5), proposer B sends Prepare(6) — invalidating A’s promises. A retries with Prepare(7). B retries with Prepare(8). This cycle can continue indefinitely with neither value ever being decided. The classic solution: designate a single leader proposer. Only the leader runs proposals. If the leader fails, a new one is elected and takes over — after waiting long enough to ensure no in-flight proposals from the old leader can still affect acceptors.

Multi-Paxos

Single-decree Paxos decides one value. Real systems need to agree on a sequence of values — a replicated log. Multi-Paxos extends single-decree Paxos by running separate Paxos instances for each log slot. The optimization: if a stable leader is in place, phase 1 (Prepare/Promise) only needs to run once to establish the leader’s authority across all future slots. Subsequent log entries can skip directly to phase 2 (Accept), reducing the protocol to a single round trip per entry. This makes Multi-Paxos practical for high-throughput replication.

Leader Leases

Even in Multi-Paxos, reads that need linearizability must go through the leader to ensure they see the latest committed value. Leader leases improve read latency: the leader acquires a lease for a bounded time window T. During that window, it is guaranteed to be the only leader (no other node can have completed a leader election within T). This allows the leader to serve reads locally from its in-memory state without a quorum round trip. Safety requires that clock skew across nodes is bounded and much smaller than T — the leader must not serve stale reads if a new leader was elected during a clock skew gap.

Raft: Multi-Paxos Made Explicit

Raft is best understood as Multi-Paxos with stronger invariants and a clearer specification. Key differences: Raft requires that a leader always have the most up-to-date log before winning an election (log completeness property), whereas Paxos allows a leader to be elected and then fill in missing entries. Raft makes log replication, leader election, and membership changes into explicitly specified sub-protocols. The result is easier to implement correctly and reason about. etcd, CockroachDB, and TiKV all use Raft. The underlying consensus guarantees are equivalent to Multi-Paxos.

Real-World Deployments: Chubby and Spanner

Google Chubby is a distributed lock service built on Paxos. It provides coarse-grained distributed locks and small file storage used by Bigtable and GFS for leader election and metadata. Chubby’s five-node Paxos group tolerates two node failures. Its design paper influenced the design of ZooKeeper, which serves the same role in the open-source ecosystem using a Paxos-like protocol called ZAB.

Google Spanner uses Paxos for replication within each shard (split). Each shard is replicated across 5 datacenters via a Paxos group. Spanner adds TrueTime — GPS and atomic clock-based timestamps with bounded uncertainty — to provide external consistency (a stronger form of serializability across shards and datacenters) without a global lock manager. Spanner’s use of Paxos at planetary scale demonstrated that consensus-based replication is practical for globally distributed OLTP systems.

Scroll to Top