How does Kafka achieve high throughput with sequential I/O?

Kafka is an immutable, append-only commit log. All writes are sequential appends -- the fastest I/O pattern on any storage device. Each partition is stored as segment files (.log data + .index offset mapping). No random I/O. Reads for recent data hit the OS page cache (recently written files are cached in memory). Zero-copy: sendfile() transfers data directly from page cache to network socket (kernel-to-kernel, no user-space copy). Result: 100+ MB/sec per partition with minimal CPU. Sequential I/O + page cache + zero-copy is why a single Kafka broker handles millions of messages per second. Producers batch messages and compress (LZ4/Snappy/zstd) before sending. Brokers write compressed batches directly to disk -- no per-message processing. Consumers read compressed batches and decompress client-side.

How does Kafka exactly-once semantics work?

Three mechanisms: (1) Idempotent producer: each message gets a sequence number. The broker deduplicates retries (same sequence = already written, skip). Ensures each message written exactly once to the partition. (2) Transactions: atomically write to multiple partitions + commit consumer offsets in one transaction. Either all succeed (commit) or none (abort). Use for consume-transform-produce pipelines. (3) Read committed isolation: consumers with isolation.level=read_committed only see committed transaction messages. Uncommitted/aborted messages are invisible. Together: end-to-end exactly-once WITHIN Kafka (read from topic A, process, write to topic B). For external systems (database writes): the consumer must be idempotent (deduplicate using message ID). Kafka EOS does not extend beyond Kafka -- external exactly-once requires application-level idempotency.

System Design: Design Kafka — Distributed Event Streaming, Partitions, Consumer Groups, Exactly-Once, Compaction

⏱ 6 min read

Apache Kafka processes trillions of events per day at companies like LinkedIn, Netflix, Uber, and Airbnb. Designing Kafka itself — the distributed commit log, partition replication, consumer group protocol, and exactly-once semantics — tests deep distributed systems knowledge. This guide covers the internal architecture for senior infrastructure interviews.

The Distributed Commit Log

Kafka is fundamentally an immutable, ordered, append-only log. A topic is divided into partitions. Each partition is an independent log stored on a broker. Messages are assigned sequential offsets (0, 1, 2, …) within a partition. Immutable: once written, messages cannot be modified or deleted (until retention expires or compaction removes old versions). Ordered: messages within a partition are strictly ordered by offset. Across partitions: no ordering guarantee. Append-only: new messages are always written to the end. Reads can start from any offset (replay from the beginning, or from a specific point). Storage: each partition is stored as a sequence of segment files on disk. Each segment: a .log file (message data) and a .index file (offset-to-byte-position mapping for fast seeks). Segments are immutable once full (rolled by size or time). Active segment: the one being written to. Old segments: eligible for deletion (retention policy) or compaction. Sequential I/O: Kafka achieves high throughput because all writes are sequential appends (the fastest I/O pattern on any storage). Reads for recent data hit the page cache (OS caches recently written files). Reads for old data are sequential scans of segment files. This design avoids random I/O entirely. Zero-copy: Kafka uses sendfile() to transfer data directly from the page cache to the network socket (kernel-to-kernel, no user-space copy). This is why Kafka can serve 100+ MB/sec per partition with low CPU.

Partitions and Replication

Partitions enable horizontal scaling: a topic with 12 partitions can be served by 12 brokers (each owns some partitions). More partitions = more parallelism (for both producers and consumers). Partition assignment: producer assigns messages to partitions by: key hash (hash(key) % num_partitions — messages with the same key go to the same partition, preserving per-key ordering) or round-robin (no key — even distribution, no ordering). Replication: each partition has one leader and N-1 follower replicas (replication factor N, typically 3). The leader handles all reads and writes. Followers replicate from the leader. ISR (In-Sync Replicas): followers that are caught up with the leader (within replica.lag.max.messages). A write is acknowledged to the producer after all ISR replicas have written it (with acks=all). If a follower falls too far behind: it is removed from the ISR. Leader election: if the leader broker fails, a new leader is elected from the ISR. The controller (a single broker elected via ZooKeeper/KRaft) manages partition leadership. KRaft (Kafka 3.0+): replaces ZooKeeper with a Raft-based consensus protocol built into Kafka. The controller quorum (3-5 brokers) manages metadata. Simplifies deployment (no separate ZooKeeper cluster). Durability guarantee: with replication factor 3 and acks=all, a message is durable as long as 2 of 3 replicas survive. Losing 2 of 3 replicas simultaneously loses data (extremely unlikely with rack-aware placement).

Consumer Groups and Offset Management

Consumer groups enable parallel consumption: each partition is assigned to exactly one consumer in the group. A group with 4 consumers and 12 partitions: each consumer handles 3 partitions. Scaling: add more consumers (up to the partition count). More consumers than partitions: excess consumers are idle. Group coordination: (1) When a consumer joins or leaves (or crashes): a rebalance occurs. Partitions are redistributed among active consumers. The group coordinator (a specific broker) manages this. (2) Rebalance protocols: eager (stop all consumers, reassign all partitions — downtime during rebalance) or cooperative/incremental (only reassign affected partitions — minimal disruption, the modern default). Offset management: each consumer tracks its position (offset) per partition. After processing a batch of messages: the consumer commits the offset (“I have processed up to offset 1234 in partition 5”). Stored in a special Kafka topic (__consumer_offsets). If the consumer crashes and restarts (or a new consumer takes over after rebalance): it resumes from the last committed offset. At-least-once: commit after processing. If the consumer crashes after processing but before committing: it reprocesses those messages on restart (duplicates). At-most-once: commit before processing. If the consumer crashes after committing but before processing: those messages are lost (skipped). Exactly-once: see below.

Exactly-Once Semantics (EOS)

Kafka 0.11+ supports exactly-once semantics for specific use cases: (1) Idempotent producer — the producer assigns a sequence number to each message. The broker detects and deduplicates retried messages (same sequence number). This ensures: even with retries (network timeout), each message is written exactly once to the partition. Enabled with: enable.idempotence=true. (2) Transactions — the producer can atomically write to multiple partitions and commit offsets in a single transaction. Either all writes succeed (commit) or none do (abort). Use for: consume-transform-produce pipelines (read from topic A, process, write to topic B + commit offset on A — all atomic). If the processing crashes: the transaction aborts, and the consumer reprocesses from the last committed offset (the partial writes to topic B are invisible to consumers). (3) Consumer isolation — consumers with isolation.level=read_committed only see messages from committed transactions. Uncommitted or aborted messages are invisible. This provides end-to-end exactly-once within Kafka (read from one topic, process, write to another). For external systems (write to a database): exactly-once requires the consumer to be idempotent (deduplicate writes to the external system using a unique message ID). Kafka EOS does not extend to external systems — it covers only Kafka-to-Kafka processing.

Log Compaction and Retention

Two retention strategies: (1) Time/size-based deletion — delete segments older than retention.ms (7 days default) or when the partition exceeds retention.bytes. Simple: old data is gone. Use for: event streams where historical events lose value over time (logs, metrics, clickstream). (2) Log compaction — retain only the latest value for each key. The compactor runs in the background: for each key, keep only the message with the highest offset (the most recent update). Older messages for the same key are deleted. The result: the topic contains a snapshot of the latest state for every key. Use for: changelog topics (database CDC), state stores (KTable materialization), and any topic where you need “the current value for key X.” Compaction + retention: combine both. Keep the latest value for each key, AND delete entries older than 7 days (even if they are the latest for their key). Tombstones: to delete a key from a compacted topic, produce a message with the key and a null value. The compactor removes the key entirely after a configurable delay (delete.retention.ms). Kafka Connect: a framework for importing/exporting data between Kafka and external systems. Source connectors: read from databases (Debezium CDC), files, APIs, and produce to Kafka. Sink connectors: consume from Kafka and write to databases, search indexes (Elasticsearch), data lakes (S3). Kafka Streams: a library for stream processing applications. Stateful transformations (join, aggregate, windowing) on Kafka topics. Runs as a regular Java application (no separate cluster needed). Uses Kafka topics for state storage (changelog topics with compaction).