Message Broker Low-Level Design: Topic Partitioning, Consumer Groups, Offset Management, and Durability

A message broker decouples producers from consumers by providing a durable, ordered, partitioned log. Producers append messages to topics; consumers read from topics at their own pace, tracked by offset. The key design decisions — partitioning, replication, offset management, and storage — determine throughput, durability, and consumer scalability.

Topic and Partition Model

A topic is the logical message stream. Each topic is split into N partitions. Each partition is an independent, ordered, append-only log. Partitioning enables:

  • Parallelism: multiple consumers in a group each own a subset of partitions, processing in parallel.
  • Ordering guarantee: messages within a partition are strictly ordered by offset; across partitions, ordering is not guaranteed.
  • Horizontal scaling: more partitions → more throughput (more parallel writes and reads).

Each message in a partition has a monotonically increasing offset. Offsets are the cursor used by consumers to track their position.

Message Format

Each message record in a partition contains:

  • Offset: 64-bit monotonically increasing integer within the partition.
  • Timestamp: producer-assigned or broker-assigned.
  • Key: optional bytes; used for partition routing (same key → same partition) and log compaction.
  • Value: payload bytes (Avro, Protobuf, JSON — schema is external to the broker).
  • Headers: key-value metadata pairs (e.g., trace_id, content_type, correlation_id).

Messages are batched into record batches for efficient network transfer and disk writes. A batch is compressed as a unit (Snappy, LZ4, or ZSTD) before storage.

Producer: Partition Assignment and Acks

Partition routing:

  • Key-based: partition = hash(key) % num_partitions. Guarantees all messages for a key go to the same partition (ordering).
  • Round-robin: used when key is null; distributes load evenly.
  • Custom partitioner: application-defined logic (e.g., geo-based routing).

Acknowledgment modes (acks):

  • acks=0: fire-and-forget; producer does not wait for broker ACK. Maximum throughput, zero durability guarantee.
  • acks=1: leader ACK; producer waits until the partition leader has written the message. Fast, but data can be lost if the leader fails before replication.
  • acks=all: all ISR replicas must ACK. Maximum durability; slightly higher latency due to replication round-trip.

Consumer Group and Offset Management

A consumer group allows multiple consumers to jointly consume a topic. Each partition is assigned to exactly one consumer in the group at any time. If a group has fewer consumers than partitions, some consumers handle multiple partitions. If more consumers than partitions, some consumers are idle.

Offset commit semantics:

  • At-least-once: consumer processes messages, then commits offset. If consumer crashes between process and commit, messages are re-read on restart. Requires idempotent downstream processing.
  • At-most-once: consumer commits offset before processing. If crash occurs, messages are skipped. Avoids duplicates but risks message loss.
  • Exactly-once: transactional API — downstream write and offset commit are atomic via Kafka's transaction coordinator.

Log-Structured Storage

Each partition is stored as a series of segments on disk. The active segment is the one currently being appended to; older segments are immutable.

Each segment consists of:

  • .log file: the raw message data (record batches, sequential append).
  • .index file: sparse offset-to-file-position index. Enables binary search for a given offset without scanning the entire log file.
  • .timeindex file: sparse timestamp-to-offset index for time-based lookup (e.g., “give me all messages after 14:00”).

Retention: segments are deleted when they exceed either the time-based retention limit (e.g., 7 days) or the size-based limit (e.g., 50 GB per partition). Log compaction retains only the latest message per key, reducing storage for changelog topics.

Replication: Leader and ISR

Each partition has one leader and R-1 followers. All reads and writes go through the leader. Followers fetch messages from the leader and write them to their own logs. A follower in the ISR is within replica.lag.time.max.ms of the leader.

Leader election: managed by the broker controller (Zookeeper-based or KRaft). On leader failure, the controller elects the first ISR member as new leader. No data loss because ISR replicas are fully caught up.

SQL DDL: Metadata Tables

CREATE TABLE Topic (
    id                BIGSERIAL PRIMARY KEY,
    name              VARCHAR(255)  NOT NULL UNIQUE,
    partition_count   INTEGER       NOT NULL DEFAULT 3,
    replication_factor INTEGER      NOT NULL DEFAULT 3,
    retention_ms      BIGINT        NOT NULL DEFAULT 604800000  -- 7 days
);

CREATE TABLE ConsumerGroup (
    id          BIGSERIAL PRIMARY KEY,
    group_id    VARCHAR(255)  NOT NULL UNIQUE,
    protocol    VARCHAR(32)   NOT NULL DEFAULT 'range'
);

-- Current offset per consumer group per partition
CREATE TABLE PartitionAssignment (
    group_id    VARCHAR(255)  NOT NULL,
    topic       VARCHAR(255)  NOT NULL,
    partition   INTEGER       NOT NULL,
    consumer_id VARCHAR(255),
    "offset"    BIGINT        NOT NULL DEFAULT 0,
    PRIMARY KEY (group_id, topic, partition)
);

Python: Core Operations

from confluent_kafka import Producer, Consumer, KafkaError
from confluent_kafka.admin import AdminClient, NewTopic
import json

BOOTSTRAP_SERVERS = 'localhost:9092'

admin = AdminClient({'bootstrap.servers': BOOTSTRAP_SERVERS})

def produce(topic: str, key: str, value: dict, acks: str = 'all') -> None:
    """Produce a single message with the given key and JSON value."""
    producer = Producer({
        'bootstrap.servers': BOOTSTRAP_SERVERS,
        'acks': acks,
        'enable.idempotence': True,   # exactly-once producer
        'compression.type': 'lz4',
        'linger.ms': 5,               # batch up to 5ms for throughput
    })

    def delivery_report(err, msg):
        if err:
            print(f"Delivery failed: {err}")
        else:
            print(f"Delivered to {msg.topic()} [{msg.partition()}] @ offset {msg.offset()}")

    producer.produce(
        topic,
        key=key.encode('utf-8'),
        value=json.dumps(value).encode('utf-8'),
        callback=delivery_report
    )
    producer.flush()

def consume(group_id: str, topics: list[str], max_messages: int = 10) -> list[dict]:
    """Consume messages from topics using consumer group semantics."""
    consumer = Consumer({
        'bootstrap.servers': BOOTSTRAP_SERVERS,
        'group.id': group_id,
        'auto.offset.reset': 'earliest',
        'enable.auto.commit': False,   # manual commit for at-least-once control
    })
    consumer.subscribe(topics)
    messages = []
    try:
        while len(messages)  None:
    """Manually commit a specific offset for a partition."""
    from confluent_kafka import TopicPartition
    consumer = Consumer({'bootstrap.servers': BOOTSTRAP_SERVERS, 'group.id': group_id})
    consumer.commit(offsets=[TopicPartition(topic, partition, offset + 1)], asynchronous=False)
    consumer.close()

Design Considerations Summary

  • Partition key design: use business entity IDs (user_id, order_id) for ordering; avoid low-cardinality keys that cause hot partitions.
  • ISR and acks: acks=all + min.insync.replicas=2 gives strong durability with one replica allowed to lag.
  • At-least-once vs exactly-once: at-least-once is simpler and sufficient when consumers are idempotent; EOS adds complexity but eliminates duplicates for financial or counting workloads.
  • Consumer rebalance: use cooperative incremental rebalance and static group membership to reduce rebalance disruption.
  • Retention: size-based retention is more predictable than time-based for high-throughput topics; log compaction for changelog/event-sourcing topics.

See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering

See also: Anthropic Interview Guide 2026: Process, Questions, and AI Safety

See also: Databricks Interview Guide 2026: Spark Internals, Delta Lake, and Lakehouse Architecture

See also: Uber Interview Guide 2026: Dispatch Systems, Geospatial Algorithms, and Marketplace Engineering

See also: Shopify Interview Guide

See also: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence

Scroll to Top