System Design Interview: Apache Kafka Architecture Deep Dive

⏱ 6 min read

Apache Kafka underpins event streaming at LinkedIn, Uber, Netflix, Airbnb, and thousands of other companies. Understanding Kafka architecture deeply — not just “it is a message queue” — separates strong system design candidates from the rest.

Core Architecture

"""
Kafka is a distributed commit log, not just a message queue.

Key properties that make it unique:
  1. Persistent storage: messages stored on disk (log segments)
     → Consumers can re-read any offset; enables replay
  2. Sequential I/O: appends to log files → disk I/O as fast as memory
  3. Zero-copy: sendfile() syscall sends data from disk to network socket
     without copying through user space
  4. Pull-based: consumers pull at their own rate (no back-pressure issues)
  5. Consumer groups: N consumers share work across partitions

Log anatomy:
  Topic: events → partitioned into N partitions
  Partition: ordered, immutable sequence of records
  Segment: physical file containing records (default max 1GB or 7 days)
  Offset: unique sequential ID within a partition

  Partition 0: [offset 0][offset 1][offset 2]...[offset 1M]
  Partition 1: [offset 0][offset 1]...
  Partition 2: [offset 0]...

  Consumer tracks offset per partition — reads at its own pace.
  Multiple consumer groups → each reads ALL messages independently.
"""

Replication and Durability

"""
Replication: each partition has 1 leader + (RF-1) followers.
  RF=3: leader on broker 1, followers on brokers 2 and 3.
  Writes go to leader → leader replicates to followers.
  ISR (In-Sync Replicas): followers within lag threshold.
  min.insync.replicas=2: write acknowledged only when 2 ISR confirm.

Durability settings:
  acks=0:   Producer does not wait (fastest, may lose data)
  acks=1:   Leader acknowledges (leader fail → data loss)
  acks=all: All ISR acknowledge (safe; combine with min.insync.replicas=2)

Replication factor vs durability:
  RF=1: no redundancy — single point of failure
  RF=2: can survive 1 broker failure, but risky during rolling restart
  RF=3: production standard — survive 1 failure with room to spare

Leader election:
  KRaft mode (Kafka 3.3+ native, replaces ZooKeeper):
    KRaft controller quorum handles metadata and leader election.
  Legacy (ZooKeeper):
    Controller watches ZooKeeper for broker health; elects new leaders.

Unclean leader election:
  unclean.leader.election.enable=false (default, recommended):
    Only ISR can become leader → durability preserved.
  =true: out-of-sync replica can become leader → possible data loss.
    Use only when availability beats consistency.
"""

Partitioning Strategy

from confluent_kafka import Producer

"""
Partitioning determines which partition a message goes to:
  No key: round-robin across partitions (even distribution, no ordering)
  Key provided: hash(key) % num_partitions → same key always same partition
    → ordering guaranteed per key
    → watch for hot partitions if key distribution is skewed

Choosing partition count:
  More partitions → more parallelism → higher throughput
  But: each partition has overhead (file handles, replication traffic)
  Rule of thumb: target_throughput / throughput_per_partition
  Start with 3-6 per topic; scale when needed (adding partitions
  changes key→partition mapping, can break ordering guarantees for
  existing consumers)

Compacted topics:
  log.cleanup.policy=compact
  Kafka retains only the LATEST record per key (like a KV store)
  Good for: database change data capture (CDC), materialized views
  Old offsets: records with same key replaced, old ones garbage collected
"""

def custom_partitioner(topic, key, key_bytes, value, value_bytes,
                        cluster_metadata):
    """Priority-based partitioner: VIP messages get partition 0."""
    num_partitions = len(cluster_metadata.topics[topic].partitions)
    if key == b"vip":
        return 0
    return hash(key) % (num_partitions - 1) + 1  # Avoid partition 0 for non-VIP

producer = Producer({
    "bootstrap.servers": "kafka:9092",
    "partitioner":       custom_partitioner,
})

Exactly-Once Semantics

"""
Idempotent Producer (enable.idempotence=true):
  Each message gets a sequence number per producer epoch.
  Broker deduplicates retried messages with same sequence number.
  Prevents duplicate writes from producer retries.
  Cost: small overhead per message.

Transactional API (transactional.id=...):
  Allows atomic writes across multiple partitions/topics.
  Use case: read-process-write pipelines (consume, transform, produce)
    → either all writes committed or none.
  2-phase commit: producer begins transaction, all brokers either
    commit or abort together.
  Cost: latency (100ms+ per transaction); use for critical paths only.

Kafka Streams exactly-once:
  processing.guarantee=exactly_once_v2 (Kafka 2.5+)
  Internally uses transactional API for read-process-write loops.
"""

from confluent_kafka import Producer, Consumer

producer = Producer({
    "bootstrap.servers":    "kafka:9092",
    "enable.idempotence":   True,
    "transactional.id":     "my-service-001",  # Unique per producer instance
    "acks":                 "all",
})

consumer = Consumer({
    "bootstrap.servers":            "kafka:9092",
    "group.id":                     "my-consumer-group",
    "isolation.level":              "read_committed",  # Skip uncommitted tx records
    "enable.auto.commit":           False,
})

# Exactly-once read-process-write loop
producer.init_transactions()

def process_batch(messages):
    producer.begin_transaction()
    try:
        for msg in messages:
            record = process(msg)
            producer.produce("output-topic", key=record["key"],
                             value=record["value"])

        # Commit consumer offsets as part of the transaction
        offsets = {msg.topic_partition(): msg.offset() + 1 for msg in messages}
        producer.send_offsets_to_transaction(offsets, consumer.consumer_group_metadata())
        producer.commit_transaction()
    except Exception:
        producer.abort_transaction()
        raise

Kafka Streams vs Flink vs Spark Streaming

Feature	Kafka Streams	Apache Flink	Spark Structured Streaming
Deployment	Library (no cluster needed)	Separate cluster	Spark cluster
State management	RocksDB (local)	RocksDB + remote (configurable)	Spark RDD + checkpoints
Latency	Milliseconds	Milliseconds	100ms+ (micro-batch)
Throughput	Medium	Very high	Very high
Exactly-once	Yes (Kafka-to-Kafka)	Yes (end-to-end)	Yes (with supported sinks)
SQL support	Limited (KSQL)	Flink SQL	Spark SQL
Best for	Simple stateful transforms, microservices	Complex event processing, low latency	Batch + streaming unified

When to Use Kafka vs Alternatives

Use Kafka when	Consider alternatives when
High throughput: millions of events/second	Simple job queues: SQS, RabbitMQ are simpler
Message replay needed (re-read history)	Short retention (hours): SQS with visibility timeout
Multiple independent consumer groups	Request/reply pattern: use gRPC or HTTP
Ordered processing per key	At-most-once IoT telemetry: MQTT is lighter weight
Event sourcing / audit log	Pub/sub with fan-out filters: SNS + SQS
Data pipeline with multiple sinks	Simple task queues: Celery/Redis Queue

Companies That Ask This Question

LinkedIn Engineering Interview Guide

Databricks Interview Guide

Airbnb Engineering Interview Guide

Twitter/X Engineering Interview Guide

Uber Engineering Interview Guide

Frequently Asked Questions

What guarantees does Kafka provide for message delivery?

Kafka offers three delivery semantics. At-most-once: producer fires and forgets, consumer commits before processing — messages can be lost. At-least-once (default): producer retries on failure, consumer commits after processing — duplicates possible. Exactly-once: requires idempotent producer (enable.idempotence=true) plus transactional API on producer, and read_committed isolation on consumer. Exactly-once across Kafka-to-Kafka pipelines is supported natively; Kafka-to-external-system exactly-once requires idempotent writes at the destination.

How does Kafka achieve high throughput?

Kafka combines several I/O optimizations: (1) sequential disk writes — producers append to the end of a log file, which is as fast as RAM for spinning disks; (2) zero-copy transfers — sendfile() syscall moves data from page cache to socket without copying to user space; (3) batching — producer batches messages before sending, amortizing network RTT; (4) compression — batches are compressed (snappy/lz4/zstd) reducing network and disk I/O; (5) partition parallelism — multiple partitions allow parallel reads and writes across brokers.

What is Kafka consumer group lag and why does it matter?

Consumer group lag is the difference between the latest offset produced to a partition and the committed offset of the consumer group. Lag = 0 means consumers are caught up; high lag means consumers are falling behind producers. Lag matters because it quantifies data freshness — high lag means downstream systems are processing stale data. Monitor lag per partition using kafka-consumer-groups.sh or Burrow. Remedies: increase consumer parallelism (add instances up to partition count), optimize consumer processing, or increase retention so consumers can catch up without data loss.