A message broker decouples producers from consumers by providing a durable, ordered, partitioned log. Producers append messages to topics; consumers read from topics at their own pace, tracked by offset. The key design decisions — partitioning, replication, offset management, and storage — determine throughput, durability, and consumer scalability.
Topic and Partition Model
A topic is the logical message stream. Each topic is split into N partitions. Each partition is an independent, ordered, append-only log. Partitioning enables:
- Parallelism: multiple consumers in a group each own a subset of partitions, processing in parallel.
- Ordering guarantee: messages within a partition are strictly ordered by offset; across partitions, ordering is not guaranteed.
- Horizontal scaling: more partitions → more throughput (more parallel writes and reads).
Each message in a partition has a monotonically increasing offset. Offsets are the cursor used by consumers to track their position.
Message Format
Each message record in a partition contains:
- Offset: 64-bit monotonically increasing integer within the partition.
- Timestamp: producer-assigned or broker-assigned.
- Key: optional bytes; used for partition routing (same key → same partition) and log compaction.
- Value: payload bytes (Avro, Protobuf, JSON — schema is external to the broker).
- Headers: key-value metadata pairs (e.g., trace_id, content_type, correlation_id).
Messages are batched into record batches for efficient network transfer and disk writes. A batch is compressed as a unit (Snappy, LZ4, or ZSTD) before storage.
Producer: Partition Assignment and Acks
Partition routing:
- Key-based:
partition = hash(key) % num_partitions. Guarantees all messages for a key go to the same partition (ordering). - Round-robin: used when key is null; distributes load evenly.
- Custom partitioner: application-defined logic (e.g., geo-based routing).
Acknowledgment modes (acks):
- acks=0: fire-and-forget; producer does not wait for broker ACK. Maximum throughput, zero durability guarantee.
- acks=1: leader ACK; producer waits until the partition leader has written the message. Fast, but data can be lost if the leader fails before replication.
- acks=all: all ISR replicas must ACK. Maximum durability; slightly higher latency due to replication round-trip.
Consumer Group and Offset Management
A consumer group allows multiple consumers to jointly consume a topic. Each partition is assigned to exactly one consumer in the group at any time. If a group has fewer consumers than partitions, some consumers handle multiple partitions. If more consumers than partitions, some consumers are idle.
Offset commit semantics:
- At-least-once: consumer processes messages, then commits offset. If consumer crashes between process and commit, messages are re-read on restart. Requires idempotent downstream processing.
- At-most-once: consumer commits offset before processing. If crash occurs, messages are skipped. Avoids duplicates but risks message loss.
- Exactly-once: transactional API — downstream write and offset commit are atomic via Kafka's transaction coordinator.
Log-Structured Storage
Each partition is stored as a series of segments on disk. The active segment is the one currently being appended to; older segments are immutable.
Each segment consists of:
- .log file: the raw message data (record batches, sequential append).
- .index file: sparse offset-to-file-position index. Enables binary search for a given offset without scanning the entire log file.
- .timeindex file: sparse timestamp-to-offset index for time-based lookup (e.g., “give me all messages after 14:00”).
Retention: segments are deleted when they exceed either the time-based retention limit (e.g., 7 days) or the size-based limit (e.g., 50 GB per partition). Log compaction retains only the latest message per key, reducing storage for changelog topics.
Replication: Leader and ISR
Each partition has one leader and R-1 followers. All reads and writes go through the leader. Followers fetch messages from the leader and write them to their own logs. A follower in the ISR is within replica.lag.time.max.ms of the leader.
Leader election: managed by the broker controller (Zookeeper-based or KRaft). On leader failure, the controller elects the first ISR member as new leader. No data loss because ISR replicas are fully caught up.
SQL DDL: Metadata Tables
CREATE TABLE Topic (
id BIGSERIAL PRIMARY KEY,
name VARCHAR(255) NOT NULL UNIQUE,
partition_count INTEGER NOT NULL DEFAULT 3,
replication_factor INTEGER NOT NULL DEFAULT 3,
retention_ms BIGINT NOT NULL DEFAULT 604800000 -- 7 days
);
CREATE TABLE ConsumerGroup (
id BIGSERIAL PRIMARY KEY,
group_id VARCHAR(255) NOT NULL UNIQUE,
protocol VARCHAR(32) NOT NULL DEFAULT 'range'
);
-- Current offset per consumer group per partition
CREATE TABLE PartitionAssignment (
group_id VARCHAR(255) NOT NULL,
topic VARCHAR(255) NOT NULL,
partition INTEGER NOT NULL,
consumer_id VARCHAR(255),
"offset" BIGINT NOT NULL DEFAULT 0,
PRIMARY KEY (group_id, topic, partition)
);
Python: Core Operations
from confluent_kafka import Producer, Consumer, KafkaError
from confluent_kafka.admin import AdminClient, NewTopic
import json
BOOTSTRAP_SERVERS = 'localhost:9092'
admin = AdminClient({'bootstrap.servers': BOOTSTRAP_SERVERS})
def produce(topic: str, key: str, value: dict, acks: str = 'all') -> None:
"""Produce a single message with the given key and JSON value."""
producer = Producer({
'bootstrap.servers': BOOTSTRAP_SERVERS,
'acks': acks,
'enable.idempotence': True, # exactly-once producer
'compression.type': 'lz4',
'linger.ms': 5, # batch up to 5ms for throughput
})
def delivery_report(err, msg):
if err:
print(f"Delivery failed: {err}")
else:
print(f"Delivered to {msg.topic()} [{msg.partition()}] @ offset {msg.offset()}")
producer.produce(
topic,
key=key.encode('utf-8'),
value=json.dumps(value).encode('utf-8'),
callback=delivery_report
)
producer.flush()
def consume(group_id: str, topics: list[str], max_messages: int = 10) -> list[dict]:
"""Consume messages from topics using consumer group semantics."""
consumer = Consumer({
'bootstrap.servers': BOOTSTRAP_SERVERS,
'group.id': group_id,
'auto.offset.reset': 'earliest',
'enable.auto.commit': False, # manual commit for at-least-once control
})
consumer.subscribe(topics)
messages = []
try:
while len(messages) None:
"""Manually commit a specific offset for a partition."""
from confluent_kafka import TopicPartition
consumer = Consumer({'bootstrap.servers': BOOTSTRAP_SERVERS, 'group.id': group_id})
consumer.commit(offsets=[TopicPartition(topic, partition, offset + 1)], asynchronous=False)
consumer.close()
Design Considerations Summary
- Partition key design: use business entity IDs (user_id, order_id) for ordering; avoid low-cardinality keys that cause hot partitions.
- ISR and acks: acks=all + min.insync.replicas=2 gives strong durability with one replica allowed to lag.
- At-least-once vs exactly-once: at-least-once is simpler and sufficient when consumers are idempotent; EOS adds complexity but eliminates duplicates for financial or counting workloads.
- Consumer rebalance: use cooperative incremental rebalance and static group membership to reduce rebalance disruption.
- Retention: size-based retention is more predictable than time-based for high-throughput topics; log compaction for changelog/event-sourcing topics.
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering
See also: Anthropic Interview Guide 2026: Process, Questions, and AI Safety
See also: Databricks Interview Guide 2026: Spark Internals, Delta Lake, and Lakehouse Architecture
See also: Uber Interview Guide 2026: Dispatch Systems, Geospatial Algorithms, and Marketplace Engineering
See also: Shopify Interview Guide