System Design Interview: Raft Consensus Algorithm Deep Dive

Why Consensus Matters

Distributed systems need to agree on values — which server is the primary, what the current config is, whether a transaction committed. Without consensus, two servers can independently believe they are the leader (split-brain), leading to data corruption. Raft is the consensus algorithm behind etcd (Kubernetes config store), CockroachDB, TiKV, Consul, and many other production systems. It was designed to be more understandable than Paxos while providing equivalent safety guarantees.

Raft Overview: Three Subproblems

Raft decomposes consensus into three relatively independent subproblems:

  1. Leader election: one server is elected leader at any given time; if it fails, a new leader is elected
  2. Log replication: the leader accepts log entries from clients and replicates them to followers; the leader ensures all committed entries are durable on a majority
  3. Safety: if any server has applied a log entry at a given index, no other server can apply a different entry for the same index

Server States and Terms

Each Raft server is in one of three states: Follower, Candidate, or Leader. Raft divides time into terms — monotonically increasing integers. Each term begins with an election. A term has at most one leader (or no leader if the election fails). Terms act as a logical clock: any message with a stale term is rejected.

  • All servers start as followers
  • A follower becomes a candidate if it does not hear from a leader within the election timeout (150-300ms, randomized)
  • A candidate that receives votes from a majority becomes leader
  • A leader sends periodic heartbeats (AppendEntries RPCs with no log entries) to maintain authority

Leader Election

When a follower’s election timeout fires:

  1. Increment current term
  2. Transition to candidate
  3. Vote for self
  4. Send RequestVote RPCs to all other servers in parallel

A server grants its vote if: (a) it has not voted in this term yet, and (b) the candidate’s log is at least as up-to-date as the voter’s log (last log term > voter’s last log term, or same last term and log length >= voter’s). The “at least as up-to-date” check is the key safety property — it ensures only servers with all committed entries can become leader.

The candidate wins if it receives votes from a majority (⌊N/2⌋ + 1 for N servers). Randomized election timeouts prevent split votes: if all servers started elections simultaneously, they would all tie. With randomized timeouts (150-300ms), one server typically starts and wins before others time out.

Log Replication

Once elected, the leader handles all client writes:

  1. Append the entry to its local log with the current term number
  2. Send AppendEntries RPCs to all followers in parallel
  3. Once a majority have responded successfully, the entry is committed
  4. Apply the entry to the state machine and respond to the client
  5. Notify followers of the commit index via the next AppendEntries (they apply up to commit index)

AppendEntries includes a consistency check: the RPC carries the previous log entry’s index and term. A follower rejects the RPC if its log does not match. The leader decrements nextIndex for that follower and retries — eventually walking back to find the point of divergence. This ensures follower logs always converge to the leader’s log.

Safety: Why No Two Leaders Can Commit for the Same Index

Two leaders could theoretically exist during a network partition. Suppose Leader A (term 3) is partitioned and Leader B (term 4) is elected. Leader A cannot commit any new entries — it cannot reach a majority. Leader B can commit. When the partition heals, Leader A discovers B’s higher term, steps down, and its uncommitted entries are overwritten by B’s log. Only committed entries (acknowledged by a majority) are guaranteed to survive. The leader election guarantee ensures Leader B has all committed entries (it won a majority vote, and at least one voter had all committed entries), so no committed entry is ever lost.

Log Compaction: Snapshots

An ever-growing log is impractical. Raft uses snapshots: once the log grows large, the state machine takes a snapshot of its current state, and all log entries up to the snapshot point are discarded. The snapshot includes: last included index, last included term, and the state machine state. If a follower is too far behind (its nextIndex is before the snapshot), the leader sends InstallSnapshot RPC instead of individual AppendEntries.

Cluster Membership Changes

Adding or removing servers from a running Raft cluster is tricky — naively switching from old to new configuration can create two disjoint majorities. Raft uses joint consensus: first transition to a joint config (C_old + C_new), then to C_new. Entries must be committed by a majority of both old and new configurations during the joint phase, preventing split leadership.

Production Usage

etcd: 3 or 5 node cluster; stores all Kubernetes cluster state (pods, services, configmaps). Uses Raft with linearizable reads (read index protocol to avoid stale reads from followers). Typical write latency: 1-2ms within a datacenter. CockroachDB: each range (64MB key-value shard) has its own Raft group with 3 replicas. A write must commit to Raft before returning. Multi-range transactions use a parallel commit protocol on top of Raft. Consul: Raft for service catalog and KV store; typical election timeout 150-300ms; leader failover in under 1 second.

Raft vs Paxos

Paxos (specifically Multi-Paxos) is mathematically equivalent to Raft but has several gaps: Paxos only defines single-decree consensus (agreement on one value), not how to build a replicated log; leader election and log gap handling are underspecified; different implementations make different choices, making Paxos harder to understand and implement correctly. Raft was designed with understandability as a first-class goal — the paper includes a user study showing Raft is significantly easier to understand than Paxos. Both provide the same safety guarantees.

Key Interview Points

  • Leader is elected by majority vote; election timeout is randomized to prevent split votes
  • A server only votes for a candidate whose log is at least as up-to-date — ensures leader has all committed entries
  • Entries commit when a majority acknowledge receipt; commit is safe even if the leader crashes immediately after
  • Two leaders can coexist briefly during partitions, but only the higher-term leader can commit new entries; lower-term entries are overwritten when partition heals
  • Snapshots compact the log; InstallSnapshot handles very-lagged followers
  • Used in: etcd, CockroachDB, TiKV, Consul, YugabyteDB

Frequently Asked Questions

How does Raft prevent split-brain (two leaders) in a distributed cluster?

Raft prevents split-brain through two mechanisms. First, leader election requires a majority vote (more than half the cluster). In a 5-node cluster, a candidate needs 3 votes. If the network partitions into two groups (2 nodes and 3 nodes), only the group of 3 can form a majority and elect a leader. The group of 2 cannot elect a leader — they cannot get 3 votes. So only one leader can exist at a time. Second, each Raft term is a monotonically increasing integer. Every RPC carries the sender's term. If a leader receives an RPC with a higher term, it immediately steps down and reverts to follower status. If Leader A (term 3) is isolated and Leader B (term 4) is elected, when the partition heals, Leader A sees B's term-4 messages and immediately steps down. The higher-term leader always wins. Leader A's uncommitted entries (those without majority acknowledgment) are safely overwritten by Leader B's log — they were never committed, so no consistency violation occurs.

What happens to in-flight writes when the Raft leader fails?

When the Raft leader fails, any log entries that were not yet replicated to a majority are lost — they were never committed and never acknowledged as successful to the client. The client receives no response (timeout) and must retry. This is why clients must implement idempotent retries with idempotency keys: if the same write is submitted again to the new leader, the new leader processes it, and the idempotency key ensures the write is not applied twice. Log entries that WERE replicated to a majority before the leader failed are committed and durable — they will appear in any new leader's log (because a server can only win the election if its log is at least as up-to-date as a majority of voters, and at least one voter in the majority has the committed entry). So the election guarantee ensures committed entries are never lost, even across leader failures.

Why does Raft use randomized election timeouts?

If all followers had the same election timeout, they would all start an election simultaneously when the leader fails. Each would vote for itself, and no candidate would receive a majority — resulting in a split vote. The cluster would retry repeatedly with no progress. Randomized timeouts (typically 150-300ms, each server independently random) ensure that with high probability, one server's timeout fires before others. That server becomes a candidate, sends RequestVote RPCs, and receives votes from the other servers (which have not yet timed out and therefore have not voted in this term). The first server to time out wins the election before others even start. In practice, Raft clusters elect a new leader in one to two election timeout periods (150-300ms) after a leader failure. The randomization range matters: too narrow and ties are frequent; too wide and failover takes longer. The range should be significantly larger than the network round-trip time but small enough that failover is acceptable.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How does Raft prevent split-brain (two leaders) in a distributed cluster?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Raft prevents split-brain through two mechanisms. First, leader election requires a majority vote (more than half the cluster). In a 5-node cluster, a candidate needs 3 votes. If the network partitions into two groups (2 nodes and 3 nodes), only the group of 3 can form a majority and elect a leader. The group of 2 cannot elect a leader — they cannot get 3 votes. So only one leader can exist at a time. Second, each Raft term is a monotonically increasing integer. Every RPC carries the sender’s term. If a leader receives an RPC with a higher term, it immediately steps down and reverts to follower status. If Leader A (term 3) is isolated and Leader B (term 4) is elected, when the partition heals, Leader A sees B’s term-4 messages and immediately steps down. The higher-term leader always wins. Leader A’s uncommitted entries (those without majority acknowledgment) are safely overwritten by Leader B’s log — they were never committed, so no consistency violation occurs.”
}
},
{
“@type”: “Question”,
“name”: “What happens to in-flight writes when the Raft leader fails?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “When the Raft leader fails, any log entries that were not yet replicated to a majority are lost — they were never committed and never acknowledged as successful to the client. The client receives no response (timeout) and must retry. This is why clients must implement idempotent retries with idempotency keys: if the same write is submitted again to the new leader, the new leader processes it, and the idempotency key ensures the write is not applied twice. Log entries that WERE replicated to a majority before the leader failed are committed and durable — they will appear in any new leader’s log (because a server can only win the election if its log is at least as up-to-date as a majority of voters, and at least one voter in the majority has the committed entry). So the election guarantee ensures committed entries are never lost, even across leader failures.”
}
},
{
“@type”: “Question”,
“name”: “Why does Raft use randomized election timeouts?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “If all followers had the same election timeout, they would all start an election simultaneously when the leader fails. Each would vote for itself, and no candidate would receive a majority — resulting in a split vote. The cluster would retry repeatedly with no progress. Randomized timeouts (typically 150-300ms, each server independently random) ensure that with high probability, one server’s timeout fires before others. That server becomes a candidate, sends RequestVote RPCs, and receives votes from the other servers (which have not yet timed out and therefore have not voted in this term). The first server to time out wins the election before others even start. In practice, Raft clusters elect a new leader in one to two election timeout periods (150-300ms) after a leader failure. The randomization range matters: too narrow and ties are frequent; too wide and failover takes longer. The range should be significantly larger than the network round-trip time but small enough that failover is acceptable.”
}
}
]
}

Companies That Ask This Question

Scroll to Top