Low Level Design: Consensus Protocol Service

Consensus Protocol Service: Raft

Raft is a consensus algorithm designed to be understandable. It decomposes consensus into relatively independent subproblems: leader election, log replication, and safety.

Raft Roles

Every server is in one of three states:

Follower — passive; responds to requests from leaders and candidates.
Candidate — used to elect a new leader.
Leader — handles all client requests; replicates log entries to followers.

Leader Election

Followers start an election when they do not receive a heartbeat within the election timeout (randomized 150–300 ms). On timeout:

Follower increments current term and transitions to Candidate.
Votes for itself and sends RequestVote RPC to all peers.
If it receives votes from a majority, it becomes Leader.
If another leader is discovered (higher term in reply), reverts to Follower.
If the election times out with no winner, start a new election.

Term number: monotonically increasing logical clock. Each server votes at most once per term (first-come-first-served). A server rejects a vote if its own log is more up-to-date than the candidate's.

Log Replication

The leader handles all writes:

Client sends command to Leader.
Leader appends entry to its local log: (term, index, command).
Leader sends AppendEntries RPC to all followers in parallel.
Once a majority has acknowledged, the entry is committed.
Leader applies command to its state machine and responds to client.
Leader notifies followers of the commit index in the next heartbeat; followers apply committed entries.

Log Entry Structure

LogEntry {
    term:    int       // term when entry was received by leader
    index:   int       // position in log
    command: bytes     // client command / state machine input
}

Commitment Rule and Safety

An entry is committed once stored on a majority of servers. Safety guarantee: a leader never overwrites or deletes entries in its log. A leader only directly commits entries from its current term; entries from previous terms are committed indirectly when a new entry from the current term is committed (Raft's "Leader Completeness" property).

Membership Changes

Adding or removing servers uses joint consensus: during the transition, both the old and new quorums must agree on decisions. This prevents split-brain when moving from one cluster configuration to another.

Snapshotting

When the log grows too large, each server independently takes a snapshot of the current state machine state and truncates its log up to the snapshot index.

Snapshot {
    last_included_index: int
    last_included_term:  int
    state_machine_state: bytes
}

InstallSnapshot RPC: leader sends a snapshot to followers that have fallen too far behind. The follower discards its log and loads the snapshot.

Linearizability and Duplicate Detection

Each command must execute exactly once even if the client retries after a timeout. Implement client sessions: assign each client a unique ID and a sequence number per request. The leader tracks the last applied sequence number per client and deduplicates replayed commands.

Key Design Decisions

Randomized election timeouts reduce the chance of split votes.
Heartbeat interval should be significantly less than the election timeout (e.g., 50 ms heartbeat, 150–300 ms election timeout).
Persistent state (currentTerm, votedFor, log) must be written to stable storage before responding to RPCs to survive crashes.
Pipelining: leader can send multiple AppendEntries RPCs without waiting for ACKs to improve throughput.
Read-only optimizations: leader can serve reads without a log round-trip by confirming it is still leader via a heartbeat quorum check.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What is the Raft consensus algorithm and how does it work?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Raft is a consensus algorithm that elects a leader to manage log replication. Followers become candidates when they don't receive heartbeats within 150-300ms, request votes from peers, and the first to win a majority becomes leader. The leader appends client commands to its log, replicates them via AppendEntries RPCs, and commits once a majority acknowledges.”
}
},
{
“@type”: “Question”,
“name”: “What is the difference between a Raft term and a log index?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A term is a monotonically increasing logical clock that advances with each new election — it identifies which leader era a server believes it is in. A log index is the sequential position of an entry within the replicated log. Each log entry carries both: the term when it was received and the index where it sits.”
}
},
{
“@type”: “Question”,
“name”: “How does Raft handle leader failure and split votes?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “When a leader fails, followers time out and start a new election with an incremented term. Randomized election timeouts (150-300ms) reduce the chance of simultaneous candidates causing a split vote. If a split vote occurs, all candidates time out again and a new election starts with fresh randomized delays.”
}
},
{
“@type”: “Question”,
“name”: “How does Raft ensure linearizability for client requests?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Raft uses client sessions with unique client IDs and per-client sequence numbers. The leader tracks the last applied sequence number for each client and deduplicates any retried commands, ensuring each command executes exactly once even if the client retries after a network timeout.”
}
}
]
}