Consensus Log Service Low-Level Design: Raft-Based Append, Leader Lease, and Log Compaction

What a Consensus Log Provides

A consensus log is a fault-tolerant, ordered sequence of records that all nodes in a cluster agree on. It is the foundation of replicated state machines: any system that can replay a log to reconstruct state can use a consensus log to stay synchronized across failures. Raft is the most readable consensus algorithm and the basis for etcd, CockroachDB, and TiKV.

Raft Roles and Terms

Every node is in one of three roles at any time:

Leader: Accepts writes, replicates to followers, drives commits
Follower: Receives log entries and heartbeats from leader
Candidate: Temporarily, during an election

Time is divided into terms — monotonically increasing integers that act as a logical clock. Each term has at most one leader. Terms advance on election. Any message from a higher term causes a node to revert to follower and update its term.

Persistent state each node must survive crashes: currentTerm, votedFor, log[].

Leader Election

When a follower's election timeout fires without receiving a heartbeat (randomized between 150–300ms to avoid split votes):

Increment currentTerm, transition to candidate, vote for self
Send RequestVote(term, candidateId, lastLogIndex, lastLogTerm) to all nodes
A node grants its vote if: it hasn't voted this term AND the candidate's log is at least as up-to-date (higher last term, or same last term and longer log)
Candidate wins if it receives votes from a majority

The up-to-date check ensures the new leader has all committed entries — Raft's safety guarantee.

Log Replication

The leader receives a client write, appends it to its local log, then replicates:

Send AppendEntries(term, leaderId, prevLogIndex, prevLogTerm, entries[], leaderCommit) to all followers in parallel
Follower accepts if its log matches at prevLogIndex/prevLogTerm, appends the new entries
Leader waits for a majority ACK (including itself)
Leader advances commitIndex to the new entry's index, notifies followers of the new commitIndex in the next heartbeat
Leader and all followers apply committed entries to their state machines in order

Followers that are behind receive a batch of missing entries. The leader tracks each follower's nextIndex and decrements it on conflict until it finds the matching point, then replays forward.

Log Entry Schema

LogEntry {
  index:   uint64   // 1-based position in log
  term:    uint64   // term when entry was created
  command: string   // state machine command type
  data:    bytes    // serialized command payload
}

Linearizable Reads

A naive read from the leader could return stale data if the leader is partitioned and a new leader has been elected. Two approaches for linearizable reads:

Read index: Leader sends a heartbeat round, waits for majority ACK confirming it is still leader, then serves the read at its current commitIndex. Adds one round-trip latency.
Leader lease: After winning an election, the leader knows it is the only leader for the minimum election timeout duration (followers won't start a new election until their timeout fires). The leader tracks its lease window and serves reads without a round-trip during the lease. Requires clocks to be sufficiently synchronized.

Log Compaction with Snapshots

The log grows unboundedly. Periodically, the state machine takes a snapshot:

State machine serializes its complete state at snapshotIndex
Snapshot is written durably with lastIncludedIndex and lastIncludedTerm
Log entries at index <= snapshotIndex are discarded

Slow followers that have fallen too far behind receive the snapshot via InstallSnapshot RPC instead of a log replay. The follower replaces its state with the snapshot and resumes log replication from snapshotIndex + 1.

Cluster Membership Changes

Adding or removing nodes safely requires care — a naive simultaneous switch can create two independent majorities. Raft uses a one-at-a-time approach:

Add new member as a non-voting learner: it receives log replication but does not count toward quorum. Wait until it catches up to within one election timeout of the leader.
Promote to voter via a separate configuration change committed through the log. Now it counts toward majority.

Removing a node is the reverse: demote to learner first, then remove from config.

Split-Brain Prevention

The majority requirement is the core safety mechanism. In a 5-node cluster, any write requires 3 ACKs. If a network partition creates a group of 2 and a group of 3, only the group of 3 can form a majority and elect a leader. The group of 2 is leaderless and rejects writes. No data is committed on both sides of the partition simultaneously.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How does Raft leader election work?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “When a follower's election timeout fires without receiving a heartbeat, it increments its term, transitions to candidate, votes for itself, and sends RequestVote RPCs to peers; a node wins if it receives votes from a majority and its log is at least as up-to-date as each voter's. Randomized election timeouts (typically 150–300 ms) reduce the probability of split votes by staggering when candidates start.”
}
},
{
“@type”: “Question”,
“name”: “How does the leader commit a log entry using quorum?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The leader appends the entry to its own log, then sends AppendEntries RPCs to all followers in parallel; once a majority (including itself) acknowledge the entry, the leader advances its commitIndex and applies the entry to the state machine before responding to the client. Followers learn of the commit on the next AppendEntries heartbeat via the leader's leaderCommit field.”
}
},
{
“@type”: “Question”,
“name”: “What is a leader lease and how does it enable fast linearizable reads?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A leader lease is a time-bounded guarantee that no other node can win an election for the lease duration, obtained by the leader after confirming a quorum of heartbeats within a bounded clock-skew window. During the lease period the leader can serve reads directly from its local state machine without issuing a round-trip quorum read, reducing latency while maintaining linearizability provided clock drift stays within the assumed bound.”
}
},
{
“@type”: “Question”,
“name”: “How does log compaction with snapshots work in Raft?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A node serializes its entire state machine into a snapshot file tagged with the last included log index and term, then truncates all log entries up to that index, reducing disk usage and speeding up follower catch-up. When a lagging follower is missing entries already compacted away, the leader sends an InstallSnapshot RPC containing the full snapshot so the follower can rebuild state without replaying the discarded log.”
}
}
]
}