Question 1

How should you choose the partition key for a Kafka topic?

Accepted Answer

The partition key determines which partition a message is routed to (via hash of the key modulo partition count). Messages with the same key always go to the same partition, guaranteeing ordering within that key. Choose a key that: (1) groups messages that must be processed in order (e.g., user_id for user events, order_id for order state machine); (2) distributes load evenly across partitions — avoid low-cardinality keys like status codes that route most traffic to a few partitions. If ordering is not required, use no key (round-robin distribution) or a random UUID for maximum parallelism. Watch for partition hot-spots when popular keys (e.g., a viral user) receive disproportionately high message rates.

Question 2

What is ISR (In-Sync Replicas) and how does it affect durability?

Accepted Answer

ISR is the set of partition replicas that are fully caught up with the leader within a configurable lag threshold (replica.lag.time.max.ms, default 10 seconds). A replica in ISR is eligible to be elected leader on failure. Producer acks=all waits for all ISR replicas to acknowledge the write before returning success. This means data in ISR cannot be lost even if the leader fails, because at least one ISR replica holds the data. If a replica falls behind (e.g., due to network congestion or slow disk), it is removed from ISR. If acks=all is configured and ISR shrinks to only the leader (min.insync.replicas not met), the broker rejects writes with NotEnoughReplicasException — a deliberate trade-off of availability for durability.

Question 3

What is the difference between at-least-once and exactly-once delivery semantics?

Accepted Answer

At-least-once delivery means a message is guaranteed to be delivered to the consumer at least one time, but may be delivered more than once if the consumer crashes after processing but before committing its offset. On restart, the consumer re-reads and re-processes messages from the last committed offset. This requires idempotent consumers (processing the same message twice has the same effect as processing it once). Exactly-once semantics (EOS) extend this with transactional producers (writes to multiple partitions in a single atomic transaction) and transactional consumers (offset commit and downstream write happen atomically). Kafka's EOS uses an epoch-based producer ID and sequence numbers to deduplicate retried produces, and two-phase commit via a transaction coordinator to atomise multi-partition writes.

Question 4

What triggers a consumer group rebalance and what is its impact?

Accepted Answer

A consumer group rebalance is triggered when: a new consumer joins the group, an existing consumer leaves (gracefully or via session timeout), the number of partitions in a subscribed topic changes, or a consumer's heartbeat times out. During a rebalance (eager rebalance protocol), all consumers in the group stop processing, revoke all their partitions, and wait for the group coordinator to issue a new partition assignment. This stop-the-world pause can last seconds to tens of seconds and causes processing lag. Mitigations: (1) Cooperative incremental rebalance — only partitions being moved are revoked; unaffected partitions continue processing; (2) static group membership — consumers with a static.member.id skip rebalance on restart within the session timeout; (3) tuning session.timeout.ms and heartbeat.interval.ms to avoid false timeout-triggered rebalances.

Question 5

How does partition key selection affect message ordering and parallelism?

Accepted Answer

Messages with the same key are routed to the same partition, ensuring ordered delivery for that key; different keys go to different partitions enabling parallel consumption; choosing a high-cardinality key (e.g., user_id) distributes load evenly.

Question 6

What are the tradeoffs of producer ack levels?

Accepted Answer

Ack=0 (fire-and-forget) maximizes throughput but risks message loss on broker failure; ack=1 (leader ack) balances throughput and durability; ack=all (ISR ack) maximizes durability but adds latency waiting for all in-sync replicas to confirm.

Question 7

How does consumer group rebalancing work?

Accepted Answer

When a consumer joins or leaves, the group coordinator triggers a rebalance; partitions are redistributed among active consumers using a partition assignment strategy (range, round-robin, sticky); during rebalance, consumption pauses briefly.

Question 8

How is log retention implemented?

Accepted Answer

Each topic has a configurable retention.ms (time-based) or retention.bytes (size-based); a background log cleaner deletes the oldest segment files that are entirely before the retention boundary while preserving the active segment.

Message Broker Low-Level Design: Topic Partitioning, Consumer Groups, Offset Management, and Durability

Topic and Partition Model

Message Format

Producer: Partition Assignment and Acks

Consumer Group and Offset Management

Log-Structured Storage

Replication: Leader and ISR

SQL DDL: Metadata Tables

Python: Core Operations

Design Considerations Summary