Message Broker Low-Level Design: Topic Partitioning, Consumer Groups, Offset Management, and Durability

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How should you choose the partition key for a Kafka topic?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The partition key determines which partition a message is routed to (via hash of the key modulo partition count). Messages with the same key always go to the same partition, guaranteeing ordering within that key. Choose a key that: (1) groups messages that must be processed in order (e.g., user_id for user events, order_id for order state machine); (2) distributes load evenly across partitions — avoid low-cardinality keys like status codes that route most traffic to a few partitions. If ordering is not required, use no key (round-robin distribution) or a random UUID for maximum parallelism. Watch for partition hot-spots when popular keys (e.g., a viral user) receive disproportionately high message rates.”
}
},
{
“@type”: “Question”,
“name”: “What is ISR (In-Sync Replicas) and how does it affect durability?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “ISR is the set of partition replicas that are fully caught up with the leader within a configurable lag threshold (replica.lag.time.max.ms, default 10 seconds). A replica in ISR is eligible to be elected leader on failure. Producer acks=all waits for all ISR replicas to acknowledge the write before returning success. This means data in ISR cannot be lost even if the leader fails, because at least one ISR replica holds the data. If a replica falls behind (e.g., due to network congestion or slow disk), it is removed from ISR. If acks=all is configured and ISR shrinks to only the leader (min.insync.replicas not met), the broker rejects writes with NotEnoughReplicasException — a deliberate trade-off of availability for durability.”
}
},
{
“@type”: “Question”,
“name”: “What is the difference between at-least-once and exactly-once delivery semantics?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “At-least-once delivery means a message is guaranteed to be delivered to the consumer at least one time, but may be delivered more than once if the consumer crashes after processing but before committing its offset. On restart, the consumer re-reads and re-processes messages from the last committed offset. This requires idempotent consumers (processing the same message twice has the same effect as processing it once). Exactly-once semantics (EOS) extend this with transactional producers (writes to multiple partitions in a single atomic transaction) and transactional consumers (offset commit and downstream write happen atomically). Kafka's EOS uses an epoch-based producer ID and sequence numbers to deduplicate retried produces, and two-phase commit via a transaction coordinator to atomise multi-partition writes.”
}
},
{
“@type”: “Question”,
“name”: “What triggers a consumer group rebalance and what is its impact?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A consumer group rebalance is triggered when: a new consumer joins the group, an existing consumer leaves (gracefully or via session timeout), the number of partitions in a subscribed topic changes, or a consumer's heartbeat times out. During a rebalance (eager rebalance protocol), all consumers in the group stop processing, revoke all their partitions, and wait for the group coordinator to issue a new partition assignment. This stop-the-world pause can last seconds to tens of seconds and causes processing lag. Mitigations: (1) Cooperative incremental rebalance — only partitions being moved are revoked; unaffected partitions continue processing; (2) static group membership — consumers with a static.member.id skip rebalance on restart within the session timeout; (3) tuning session.timeout.ms and heartbeat.interval.ms to avoid false timeout-triggered rebalances.”
}
}
]
}

A message broker decouples producers from consumers by providing a durable, ordered, partitioned log. Producers append messages to topics; consumers read from topics at their own pace, tracked by offset. The key design decisions — partitioning, replication, offset management, and storage — determine throughput, durability, and consumer scalability.

Topic and Partition Model

A topic is the logical message stream. Each topic is split into N partitions. Each partition is an independent, ordered, append-only log. Partitioning enables:

Parallelism: multiple consumers in a group each own a subset of partitions, processing in parallel.
Ordering guarantee: messages within a partition are strictly ordered by offset; across partitions, ordering is not guaranteed.
Horizontal scaling: more partitions → more throughput (more parallel writes and reads).

Each message in a partition has a monotonically increasing offset. Offsets are the cursor used by consumers to track their position.

Message Format

Each message record in a partition contains:

Offset: 64-bit monotonically increasing integer within the partition.
Timestamp: producer-assigned or broker-assigned.
Key: optional bytes; used for partition routing (same key → same partition) and log compaction.
Value: payload bytes (Avro, Protobuf, JSON — schema is external to the broker).
Headers: key-value metadata pairs (e.g., trace_id, content_type, correlation_id).

Messages are batched into record batches for efficient network transfer and disk writes. A batch is compressed as a unit (Snappy, LZ4, or ZSTD) before storage.

Producer: Partition Assignment and Acks

Partition routing:

Key-based: partition = hash(key) % num_partitions. Guarantees all messages for a key go to the same partition (ordering).
Round-robin: used when key is null; distributes load evenly.
Custom partitioner: application-defined logic (e.g., geo-based routing).

Acknowledgment modes (acks):

acks=0: fire-and-forget; producer does not wait for broker ACK. Maximum throughput, zero durability guarantee.
acks=1: leader ACK; producer waits until the partition leader has written the message. Fast, but data can be lost if the leader fails before replication.
acks=all: all ISR replicas must ACK. Maximum durability; slightly higher latency due to replication round-trip.

Consumer Group and Offset Management

A consumer group allows multiple consumers to jointly consume a topic. Each partition is assigned to exactly one consumer in the group at any time. If a group has fewer consumers than partitions, some consumers handle multiple partitions. If more consumers than partitions, some consumers are idle.

Offset commit semantics:

At-least-once: consumer processes messages, then commits offset. If consumer crashes between process and commit, messages are re-read on restart. Requires idempotent downstream processing.
At-most-once: consumer commits offset before processing. If crash occurs, messages are skipped. Avoids duplicates but risks message loss.
Exactly-once: transactional API — downstream write and offset commit are atomic via Kafka's transaction coordinator.

Log-Structured Storage

Each partition is stored as a series of segments on disk. The active segment is the one currently being appended to; older segments are immutable.

Each segment consists of:

.log file: the raw message data (record batches, sequential append).
.index file: sparse offset-to-file-position index. Enables binary search for a given offset without scanning the entire log file.
.timeindex file: sparse timestamp-to-offset index for time-based lookup (e.g., “give me all messages after 14:00”).

Retention: segments are deleted when they exceed either the time-based retention limit (e.g., 7 days) or the size-based limit (e.g., 50 GB per partition). Log compaction retains only the latest message per key, reducing storage for changelog topics.

Replication: Leader and ISR

Each partition has one leader and R-1 followers. All reads and writes go through the leader. Followers fetch messages from the leader and write them to their own logs. A follower in the ISR is within replica.lag.time.max.ms of the leader.

Leader election: managed by the broker controller (Zookeeper-based or KRaft). On leader failure, the controller elects the first ISR member as new leader. No data loss because ISR replicas are fully caught up.

SQL DDL: Metadata Tables

CREATE TABLE Topic (
    id                BIGSERIAL PRIMARY KEY,
    name              VARCHAR(255)  NOT NULL UNIQUE,
    partition_count   INTEGER       NOT NULL DEFAULT 3,
    replication_factor INTEGER      NOT NULL DEFAULT 3,
    retention_ms      BIGINT        NOT NULL DEFAULT 604800000  -- 7 days
);

CREATE TABLE ConsumerGroup (
    id          BIGSERIAL PRIMARY KEY,
    group_id    VARCHAR(255)  NOT NULL UNIQUE,
    protocol    VARCHAR(32)   NOT NULL DEFAULT 'range'
);

-- Current offset per consumer group per partition
CREATE TABLE PartitionAssignment (
    group_id    VARCHAR(255)  NOT NULL,
    topic       VARCHAR(255)  NOT NULL,
    partition   INTEGER       NOT NULL,
    consumer_id VARCHAR(255),
    "offset"    BIGINT        NOT NULL DEFAULT 0,
    PRIMARY KEY (group_id, topic, partition)
);

Python: Core Operations

from confluent_kafka import Producer, Consumer, KafkaError
from confluent_kafka.admin import AdminClient, NewTopic
import json

BOOTSTRAP_SERVERS = 'localhost:9092'

admin = AdminClient({'bootstrap.servers': BOOTSTRAP_SERVERS})

def produce(topic: str, key: str, value: dict, acks: str = 'all') -> None:
    """Produce a single message with the given key and JSON value."""
    producer = Producer({
        'bootstrap.servers': BOOTSTRAP_SERVERS,
        'acks': acks,
        'enable.idempotence': True,   # exactly-once producer
        'compression.type': 'lz4',
        'linger.ms': 5,               # batch up to 5ms for throughput
    })

    def delivery_report(err, msg):
        if err:
            print(f"Delivery failed: {err}")
        else:
            print(f"Delivered to {msg.topic()} [{msg.partition()}] @ offset {msg.offset()}")

    producer.produce(
        topic,
        key=key.encode('utf-8'),
        value=json.dumps(value).encode('utf-8'),
        callback=delivery_report
    )
    producer.flush()

def consume(group_id: str, topics: list[str], max_messages: int = 10) -> list[dict]:
    """Consume messages from topics using consumer group semantics."""
    consumer = Consumer({
        'bootstrap.servers': BOOTSTRAP_SERVERS,
        'group.id': group_id,
        'auto.offset.reset': 'earliest',
        'enable.auto.commit': False,   # manual commit for at-least-once control
    })
    consumer.subscribe(topics)
    messages = []
    try:
        while len(messages)  None:
    """Manually commit a specific offset for a partition."""
    from confluent_kafka import TopicPartition
    consumer = Consumer({'bootstrap.servers': BOOTSTRAP_SERVERS, 'group.id': group_id})
    consumer.commit(offsets=[TopicPartition(topic, partition, offset + 1)], asynchronous=False)
    consumer.close()

Design Considerations Summary

Partition key design: use business entity IDs (user_id, order_id) for ordering; avoid low-cardinality keys that cause hot partitions.
ISR and acks: acks=all + min.insync.replicas=2 gives strong durability with one replica allowed to lag.
At-least-once vs exactly-once: at-least-once is simpler and sufficient when consumers are idempotent; EOS adds complexity but eliminates duplicates for financial or counting workloads.
Consumer rebalance: use cooperative incremental rebalance and static group membership to reduce rebalance disruption.
Retention: size-based retention is more predictable than time-based for high-throughput topics; log compaction for changelog/event-sourcing topics.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How does partition key selection affect message ordering and parallelism?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Messages with the same key are routed to the same partition, ensuring ordered delivery for that key; different keys go to different partitions enabling parallel consumption; choosing a high-cardinality key (e.g., user_id) distributes load evenly.”
}
},
{
“@type”: “Question”,
“name”: “What are the tradeoffs of producer ack levels?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Ack=0 (fire-and-forget) maximizes throughput but risks message loss on broker failure; ack=1 (leader ack) balances throughput and durability; ack=all (ISR ack) maximizes durability but adds latency waiting for all in-sync replicas to confirm.”
}
},
{
“@type”: “Question”,
“name”: “How does consumer group rebalancing work?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “When a consumer joins or leaves, the group coordinator triggers a rebalance; partitions are redistributed among active consumers using a partition assignment strategy (range, round-robin, sticky); during rebalance, consumption pauses briefly.”
}
},
{
“@type”: “Question”,
“name”: “How is log retention implemented?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Each topic has a configurable retention.ms (time-based) or retention.bytes (size-based); a background log cleaner deletes the oldest segment files that are entirely before the retention boundary while preserving the active segment.”
}
}
]
}