System Design: Apache Kafka — Architecture, Partitions, and Stream Processing

What Is Apache Kafka?

Apache Kafka is a distributed event streaming platform built for high-throughput, fault-tolerant, real-time data pipelines. Unlike traditional message queues (RabbitMQ, SQS), Kafka stores messages durably on disk and allows consumers to replay any offset — making it both a message queue and an event log.

Core Architecture

Topics, Partitions, Offsets

  • Topic: logical stream of records (e.g., order-events). Append-only log.
  • Partition: a topic is split into N partitions, each an ordered, immutable sequence of records. Partitions enable parallelism — producers write to different partitions simultaneously.
  • Offset: integer position of a record within a partition. Consumer tracks its own offset per partition (stored in __consumer_offsets topic). Consumers can seek backward and replay.
  • Key-based partitioning: records with the same key always land in the same partition (consistent hashing on key), guaranteeing per-key ordering. Round-robin partitioning for keyless records.

Brokers and Replication

  • A Kafka cluster has N brokers (3 is standard). Each partition has one leader and RF-1 followers.
  • Producers write to the leader; followers pull and replicate. Leader tracks which followers are in-sync (ISR — In-Sync Replicas).
  • acks=0: fire and forget (no durability). acks=1: leader acknowledgment only (fast, small risk of loss on leader crash). acks=all: all ISR must acknowledge (strong durability, higher latency).
  • If a leader crashes, the controller elects a new leader from the ISR. Unclean leader election (electing out-of-sync follower) can cause data loss — disabled by default.

Zookeeper vs. KRaft

Historically, Kafka required ZooKeeper for cluster metadata (leader election, broker registration). Kafka 3.x+ uses KRaft (Kafka Raft) — Kafka manages its own metadata via an internal Raft consensus log, eliminating the ZooKeeper dependency. KRaft supports millions of partitions and simplifies operations.

Producers

  • Batching: producers accumulate records into batches (batch.size) before sending. Larger batches improve throughput (fewer network round-trips) at the cost of latency (linger.ms adds wait time to fill batches).
  • Compression: GZIP, Snappy, LZ4, ZSTD — applied at the batch level. LZ4 is fastest for most workloads; ZSTD best for ratio.
  • Idempotent producer: set enable.idempotence=true. Broker deduplicates retried records using producer ID + sequence number. Prevents duplicates from network retries.
  • Transactions: atomic writes across multiple partitions/topics. Begin transaction → produce → commit. Consumer reads only committed records with isolation.level=read_committed.

Consumers and Consumer Groups

  • Consumer group: set of consumers sharing a group.id. Kafka assigns each partition to exactly one consumer in the group. With 6 partitions and 3 consumers: each consumer reads 2 partitions.
  • Rebalancing: when a consumer joins or leaves the group, Kafka reassigns partitions. Eager rebalance revokes all partitions first (causes pause); cooperative (incremental) rebalance moves only affected partitions.
  • Offset commit: consumers periodically commit their current offset. Auto-commit (enable.auto.commit=true) risks processing the same record twice after a crash. Manual commit after processing ensures at-least-once. Exactly-once requires transactional producers + consumers.
  • Lag: difference between latest offset and consumer’s committed offset. High lag → consumer falling behind. Monitor with kafka-consumer-groups.sh --describe or Burrow.

Retention and Compaction

  • Time-based retention: delete records older than log.retention.hours (default 168h/7 days).
  • Size-based retention: delete oldest segments when partition exceeds log.retention.bytes.
  • Log compaction: for topics configured with cleanup.policy=compact, Kafka retains only the latest record per key. Used for changelog topics (materialized views, database CDC). Deleted records use tombstone values (null).

Kafka Streams and KSQL

  • Kafka Streams: Java library for stream processing directly within Kafka. No separate cluster needed. Stateful operations (joins, aggregations) use RocksDB-backed state stores replicated to changelog topics.
  • ksqlDB: SQL interface over Kafka Streams. Create streams and tables with SQL; push queries for real-time results.
  • Flink/Spark: for heavy stateful processing (complex joins, windowed aggregations across large state), use Flink or Spark Structured Streaming reading from Kafka.

Common Design Patterns

  • Event sourcing: every state change is an event. Kafka is the event log; services rebuild state by replaying. Pairs with CQRS — write path = Kafka producer, read path = consumer updating a read model.
  • Outbox pattern: in a distributed transaction, write to DB + write to an outbox table atomically. A CDC tool (Debezium) reads the outbox and publishes to Kafka. Avoids dual-write inconsistency.
  • Fan-out: one producer → multiple consumer groups, each processing independently (analytics, notifications, search indexing).
  • Dead letter queue (DLQ): failed messages are sent to a separate DLQ topic after N retries. A separate consumer processes DLQ for alerting and manual reprocessing.

Throughput and Scaling

  • Kafka handles millions of messages/second. LinkedIn’s cluster processes 7 trillion messages/day.
  • Increase partitions to scale throughput (more parallelism). Rule: partitions ≥ peak_throughput / per_partition_throughput (typically 10–50 MB/s per partition depending on disk).
  • Scale consumers: add consumers up to the number of partitions. Beyond that, consumers sit idle.
  • Multi-region: MirrorMaker 2 replicates topics between clusters. Active-active is complex due to offset divergence — use different topic namespaces per region.

Interview Questions: Kafka

Q: Why is Kafka faster than traditional message queues?

Sequential disk I/O (append-only log) is faster than random I/O. Zero-copy: Kafka uses sendfile() system call to transfer data from disk to network socket without copying to userspace. Batching and compression reduce network overhead. Consumer pull model avoids push-related backpressure issues.

Q: How do you guarantee exactly-once delivery in Kafka?

Three requirements: (1) Idempotent producer (enable.idempotence=true) — deduplicates retried produces. (2) Transactional producer — atomically commits across partitions. (3) Consumer with isolation.level=read_committed — ignores uncommitted records. Together these form Kafka’s exactly-once semantics (EOS). Exactly-once end-to-end also requires idempotent consumers (e.g., database upsert with unique constraint).

Q: What happens when a consumer is slower than the producer?

Consumer lag grows. Kafka retains data per retention policy, so as long as lag doesn’t exceed retention, the consumer can catch up. If lag exceeds retention, messages are lost (for the consumer — data still existed, just expired). Solutions: increase consumer parallelism (add partitions + consumers), optimize consumer processing, or use a separate consumer group to process in priority order.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What is the difference between a Kafka topic partition and a consumer group?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A partition is a unit of parallelism for producers u2014 data is written to partitions (using key-based or round-robin assignment). A consumer group is a set of consumers that collectively read a topic u2014 Kafka assigns each partition to exactly one consumer in the group. With 6 partitions and 3 consumers, each consumer reads 2 partitions. Adding a 7th consumer leaves it idle u2014 you can’t have more active consumers than partitions. Multiple independent consumer groups each receive all messages from the topic (fan-out).”
}
},
{
“@type”: “Question”,
“name”: “What does acks=all mean in Kafka and when should you use it?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “acks=all (or acks=-1) means the producer waits for all in-sync replicas (ISR) to acknowledge the write before returning success. This provides the strongest durability guarantee u2014 no data loss even if the leader crashes immediately after the write, as long as at least one ISR survives. Trade-off: higher write latency (must wait for all ISR round-trips). Use acks=all for financial transactions, audit logs, or any data where loss is unacceptable. Combine with min.insync.replicas=2 to ensure the ISR has at least 2 replicas.”
}
},
{
“@type”: “Question”,
“name”: “How does Kafka guarantee message ordering?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Kafka guarantees ordering within a partition u2014 messages are appended sequentially and consumed in order. Across partitions, there is no ordering guarantee. To enforce per-entity ordering (e.g., all events for a user), use the entity ID as the message key u2014 Kafka hashes the key to consistently route all messages for that key to the same partition. With a single partition per topic you get global order, but lose parallelism. For most use cases, per-key ordering within a partition is sufficient.”
}
},
{
“@type”: “Question”,
“name”: “What is log compaction in Kafka and when do you use it?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Log compaction retains only the latest record per key in a partition, rather than all records. It’s configured per topic with cleanup.policy=compact. Use it for: change data capture (CDC) topics where you only need the latest state of each entity, materialized views, or configuration topics where earlier values are superseded. Deleted records use tombstone messages (null value) u2014 compaction eventually removes them. Compacted topics allow consumers to rebuild the complete current state by reading all retained records, even after many updates.”
}
},
{
“@type”: “Question”,
“name”: “What is the outbox pattern and why is it used with Kafka?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The outbox pattern solves the dual-write problem: atomically writing to a database AND publishing to Kafka. Without it, if you write to the DB and then the process crashes before publishing to Kafka, the event is lost. Solution: write to both the main table AND an outbox table in the same database transaction. A CDC tool (Debezium) reads the outbox table’s write-ahead log and publishes to Kafka. The database transaction is the atomicity boundary u2014 either both writes happen or neither does. The outbox is then consumed and cleared asynchronously.”
}
}
]
}

Asked at: LinkedIn Interview Guide

Asked at: Uber Interview Guide

Asked at: Netflix Interview Guide

Asked at: Databricks Interview Guide

Asked at: Cloudflare Interview Guide

Scroll to Top