Consensus Service Low-Level Design: Raft Protocol, Log Replication, and Membership Changes

Requirements and Constraints

A consensus service provides a strongly consistent, fault-tolerant replicated state machine that multiple clients can use to coordinate distributed state. It is the foundation for distributed locks, leader election, configuration management, and service discovery. Functional requirements: linearizable reads and writes, durability of committed entries across node failures, cluster membership changes without downtime, and log compaction to bound storage growth.

Constraints: the service guarantees progress as long as a majority of nodes are reachable (crash-fault tolerant, not Byzantine). Write latency is bounded by network RTT times quorum round-trips, typically targeting under 10ms p99 in a co-located datacenter. Membership changes must not introduce a window where two majorities can exist simultaneously.

Core Data Model

Log Entry

index (int64) — position in the log; 1-indexed
term (int64) — leader term when entry was created
entry_type (enum) — COMMAND, CONFIGURATION, NO_OP
data (bytes) — opaque payload applied to state machine

Snapshot

last_included_index (int64) — log index of last entry captured
last_included_term (int64) — term of that entry
state_machine_state (bytes) — complete serialized state machine
cluster_config — node membership at snapshot time

Cluster Configuration Entry

Membership changes are themselves log entries. A configuration entry contains the new member set. Raft uses a joint consensus approach: a transitional configuration C-old,new is committed first (requiring majorities of both old and new clusters), followed by C-new. This prevents any window where two independent majorities could form.

Key Algorithms and Logic

Log Replication

Once elected, the leader handles client writes by appending to its local log and sending AppendEntries RPCs to all followers in parallel. A log entry is committed once a majority of nodes (including the leader) have persisted it. The leader then advances commit_index, applies the entry to its state machine, and responds to the client. Followers apply committed entries independently by tracking commit_index from the leader's heartbeats.

Log Consistency Check

Each AppendEntries RPC includes prevLogIndex and prevLogTerm. A follower rejects the RPC if its log does not contain an entry at prevLogIndex with matching term. On rejection, the leader decrements next_index for that follower and retries. An optimization (nextIndex binary search or conflict term hint) reduces the number of round trips needed to find the follower's divergence point.

Log Compaction via Snapshotting

When a node's log exceeds a size threshold (e.g., 64MB or 10K entries), it takes a snapshot of the state machine and records the snapshot's last_included_index and last_included_term.
Log entries up to and including last_included_index are discarded.
If a follower is so far behind that the leader has already discarded the entries it needs, the leader sends an InstallSnapshot RPC instead of AppendEntries. The follower replaces its state machine with the snapshot and resumes normal replication.

Membership Changes

To add or remove nodes: the leader appends a joint configuration entry C-old,new. During joint consensus, all decisions require majorities of both old and new membership. Once C-old,new is committed, the leader appends C-new. Once C-new is committed, the cluster operates under new membership only. Nodes not in C-new shut themselves down gracefully. This two-phase approach ensures no split-brain during the transition.

API Design

Client-Facing API

POST /propose — body: { command: bytes }. Blocks until committed; returns { log_index, term, result }. Returns 503 if not leader (with redirect to leader address).
GET /read?linearizable=true — linearizable read uses ReadIndex; linearizable=false allows stale follower reads.
POST /membership/add — body: { node_id, address }. Triggers joint consensus change.
POST /membership/remove — body: { node_id }. Removes node via joint consensus.

Internal RPC

AppendEntries — log replication and heartbeat; described above.
RequestVote — election RPC; described above.
InstallSnapshot(term, leaderId, lastIncludedIndex, lastIncludedTerm, offset, data, done) — chunked snapshot transfer to lagging followers.

Scalability Considerations

Write batching: batch multiple client proposals into a single AppendEntries RPC per heartbeat interval (pipeline batching) to amortize per-RPC overhead and increase throughput significantly.
Parallel disk writes: persist log entries to disk asynchronously (group commit) and acknowledge the quorum write only after fsync completes on a majority, balancing durability with throughput.
Multi-Raft: partition state across multiple independent Raft groups (shards), each with its own leader. This eliminates the single-leader write bottleneck and allows horizontal scaling of throughput.
Follower reads: serve read-heavy workloads from followers using bounded staleness — clients specify an acceptable lag, and followers serve reads if their applied index is within that bound.
Snapshot chunking: large snapshots are transferred in chunks (e.g., 1MB each) with flow control to avoid saturating the network during recovery of a lagging node.
Pre-vote optimization: before starting a real election, a candidate sends PreVote RPCs to check if it can win. This prevents unnecessary term increments from isolated nodes that repeatedly time out and disrupt the cluster on reconnect.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How does Raft log replication work?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The leader appends a new entry to its local log and sends AppendEntries RPCs in parallel to all followers. Each follower appends the entry if its log is consistent with the leader's prevLogIndex and prevLogTerm fields, then acknowledges. Once the leader receives acknowledgements from a majority it marks the entry committed and applies it to the state machine, then notifies followers in the next heartbeat so they can advance their commit index.”
}
},
{
“@type”: “Question”,
“name”: “What is the Raft leader commit rule and why is it required?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A leader may only commit a log entry from its current term once it is stored on a majority of servers. It must not commit entries from previous terms by counting replicas alone; doing so can cause committed entries to be overwritten by a future leader with a longer log. The rule ensures that any node elected leader always has all committed entries.”
}
},
{
“@type”: “Question”,
“name”: “How do log compaction and snapshotting prevent unbounded log growth?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Each server periodically serializes its current state-machine state into a snapshot, recording the last included log index and term. All log entries up to that index can then be discarded. When a lagging follower is too far behind for the leader's retained log, the leader sends the snapshot via InstallSnapshot RPC so the follower can fast-forward without replaying thousands of old entries.”
}
},
{
“@type”: “Question”,
“name”: “How does joint consensus handle cluster membership changes safely?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Raft uses a two-phase joint consensus to avoid a window where two independent majorities could elect separate leaders. First, a joint configuration entry (containing both old and new member sets) is committed; during this phase decisions require a majority from both sets. Once committed, the new configuration entry is replicated and committed using only the new majority, completing the transition safely.”
}
}
]
}