Low Level Design: Message Queue Internals

Log-Based Storage

Modern message queues (Kafka, Pulsar, Redpanda) use an append-only log as the storage primitive. Producers write messages sequentially to the end of the log; each message is assigned an offset — either a byte position or a monotonically increasing integer index. Sequential writes exploit disk hardware: a modern SSD sustains 500MB/s+ sequential write vs. 100MB/s+ random write; spinning disks show an even larger gap. The log is never mutated: old messages are retained for a configurable period (hours, days) or until the log reaches a size limit. Consumers track their own read position (offset) rather than the broker tracking per-consumer state. The log is split into segment files (e.g., 1GB each) so that cleanup of old messages only requires deleting the oldest segment files.

Producer Protocol

A producer connects to the cluster and fetches metadata: which broker is the leader for each partition. Writes always go to the partition leader. Producers send messages in batches: accumulate messages in a buffer for up to linger.ms milliseconds or until the batch reaches batch.size bytes, then send a single network request containing the entire batch, optionally compressed with gzip, Snappy, or LZ4. Acknowledgment modes:

  • acks=0: fire-and-forget; no durability guarantee.
  • acks=1: leader writes to its local log and acknowledges; data lost if leader crashes before replication.
  • acks=all: leader waits for all in-sync replicas to acknowledge; strongest durability.

Idempotent producers (enabled via enable.idempotence=true) attach a producer ID and per-partition sequence number to each batch. The broker deduplicates retries, making produce-on-retry safe.

Consumer Group Protocol

A consumer group is a set of consumer instances that collectively consume a topic. The invariant is that each partition is assigned to exactly one consumer in the group at any time — enabling parallel processing without duplicate consumption. A designated broker acts as the group coordinator. When a consumer joins or leaves (or crashes, detected via missed heartbeats), the coordinator triggers a rebalance. Eager rebalance: all consumers revoke all partitions, the coordinator reassigns the full partition set, consumers resume. Simple but causes a full processing pause. Cooperative (incremental) rebalance: only partitions that need to move are revoked and reassigned; other consumers keep processing. Kafka’s CooperativeStickyAssignor minimizes partition movement across rebalances.

Offset Management

Consumers commit offsets to indicate "I have processed everything up to this offset." The committed offset is conventionally last-processed + 1 (the next offset to fetch). Offsets are stored in an internal compacted topic (__consumer_offsets). On consumer restart or rebalance, consumers fetch the committed offset for their assigned partitions and resume from there. Auto-commit (enable.auto.commit=true) commits the current offset every auto.commit.interval.ms; this risks re-processing messages if the consumer crashes between a commit and processing the next batch. Manual commit lets the application explicitly call commitSync() or commitAsync() after confirming messages are fully processed, giving precise control over delivery semantics.

At-Least-Once vs. Exactly-Once

At-least-once delivery is the default: commit offsets after processing, so any crash before commit causes re-delivery of the same messages on restart. Consumers must be idempotent or use deduplication. Exactly-once semantics (EOS) in Kafka requires transactions. A transactional producer opens a transaction, writes to multiple partitions, then commits the transaction atomically. For the consume-transform-produce pattern: the consumer reads a batch, the producer writes results to an output topic and commits the input offsets — all in a single transaction. Downstream consumers reading the output topic set isolation.level=read_committed to see only committed data. EOS adds latency (~5-10ms extra per transaction commit) and is typically reserved for financial or inventory applications where duplicates cause real harm.

Back-Pressure

When consumers are slower than producers, the lag (difference between latest produced offset and consumer’s committed offset) grows. Unbounded lag means consumers fall arbitrarily behind, eventually causing out-of-memory or stale data problems. Back-pressure mechanisms:

  • Fetch limits: max.poll.records caps records per poll call; fetch.max.bytes caps network transfer. Reducing these slows the consumer’s ingestion rate to match its processing speed.
  • Consumer pause: call consumer.pause(partitions) to stop fetching from specific partitions without leaving the group. Resume with consumer.resume() when the processing queue drains below a threshold.
  • Producer quotas: brokers enforce per-producer byte-rate quotas. Producers exceeding their quota are throttled with a delay response; this propagates back-pressure to the source.
  • Scaling consumers: add more consumers to the group (up to the number of partitions) to increase parallel processing throughput.

Replication

Each partition has one leader and N-1 follower replicas distributed across brokers. Producers write exclusively to the leader; followers fetch from the leader in parallel. The In-Sync Replica (ISR) set contains replicas that have caught up within replica.lag.time.max.ms. If a follower falls behind (e.g., due to GC pause or network partition), it is removed from ISR. With acks=all, the leader only acknowledges after all ISR members confirm the write — so min.insync.replicas sets the minimum ISR size required for the write to succeed (typically 2 for a replication factor of 3). On leader failure, the controller elects a new leader from the ISR, guaranteeing no data loss for acks=all writes. Unclean leader election (electing an out-of-ISR replica) trades availability for potential data loss.

Compacted Topics

Compacted topics use an alternative retention policy: instead of time- or size-based deletion, the broker retains only the latest message per key. Older messages with the same key are garbage-collected by the log cleaner, which runs as a background thread comparing segment files. A tombstone message (a record with a non-null key and a null value) signals deletion: after compaction, even the tombstone is eventually removed. Compacted topics are the foundation of Kafka Streams’ changelog topics: each state store’s mutations are logged to a compacted topic keyed by state key. A new instance of the application can reconstruct full state by replaying the compacted topic from the beginning — reading only the latest value per key — in O(distinct keys) time rather than O(total events). This pattern is also used for CDC (Change Data Capture) sinks, configuration stores, and event sourcing projections.

Scroll to Top