Apache Kafka underpins event streaming at LinkedIn, Uber, Netflix, Airbnb, and thousands of other companies. Understanding Kafka architecture deeply — not just “it is a message queue” — separates strong system design candidates from the rest.
Core Architecture
"""
Kafka is a distributed commit log, not just a message queue.
Key properties that make it unique:
1. Persistent storage: messages stored on disk (log segments)
→ Consumers can re-read any offset; enables replay
2. Sequential I/O: appends to log files → disk I/O as fast as memory
3. Zero-copy: sendfile() syscall sends data from disk to network socket
without copying through user space
4. Pull-based: consumers pull at their own rate (no back-pressure issues)
5. Consumer groups: N consumers share work across partitions
Log anatomy:
Topic: events → partitioned into N partitions
Partition: ordered, immutable sequence of records
Segment: physical file containing records (default max 1GB or 7 days)
Offset: unique sequential ID within a partition
Partition 0: [offset 0][offset 1][offset 2]...[offset 1M]
Partition 1: [offset 0][offset 1]...
Partition 2: [offset 0]...
Consumer tracks offset per partition — reads at its own pace.
Multiple consumer groups → each reads ALL messages independently.
"""
Replication and Durability
"""
Replication: each partition has 1 leader + (RF-1) followers.
RF=3: leader on broker 1, followers on brokers 2 and 3.
Writes go to leader → leader replicates to followers.
ISR (In-Sync Replicas): followers within lag threshold.
min.insync.replicas=2: write acknowledged only when 2 ISR confirm.
Durability settings:
acks=0: Producer does not wait (fastest, may lose data)
acks=1: Leader acknowledges (leader fail → data loss)
acks=all: All ISR acknowledge (safe; combine with min.insync.replicas=2)
Replication factor vs durability:
RF=1: no redundancy — single point of failure
RF=2: can survive 1 broker failure, but risky during rolling restart
RF=3: production standard — survive 1 failure with room to spare
Leader election:
KRaft mode (Kafka 3.3+ native, replaces ZooKeeper):
KRaft controller quorum handles metadata and leader election.
Legacy (ZooKeeper):
Controller watches ZooKeeper for broker health; elects new leaders.
Unclean leader election:
unclean.leader.election.enable=false (default, recommended):
Only ISR can become leader → durability preserved.
=true: out-of-sync replica can become leader → possible data loss.
Use only when availability beats consistency.
"""
Partitioning Strategy
from confluent_kafka import Producer
"""
Partitioning determines which partition a message goes to:
No key: round-robin across partitions (even distribution, no ordering)
Key provided: hash(key) % num_partitions → same key always same partition
→ ordering guaranteed per key
→ watch for hot partitions if key distribution is skewed
Choosing partition count:
More partitions → more parallelism → higher throughput
But: each partition has overhead (file handles, replication traffic)
Rule of thumb: target_throughput / throughput_per_partition
Start with 3-6 per topic; scale when needed (adding partitions
changes key→partition mapping, can break ordering guarantees for
existing consumers)
Compacted topics:
log.cleanup.policy=compact
Kafka retains only the LATEST record per key (like a KV store)
Good for: database change data capture (CDC), materialized views
Old offsets: records with same key replaced, old ones garbage collected
"""
def custom_partitioner(topic, key, key_bytes, value, value_bytes,
cluster_metadata):
"""Priority-based partitioner: VIP messages get partition 0."""
num_partitions = len(cluster_metadata.topics[topic].partitions)
if key == b"vip":
return 0
return hash(key) % (num_partitions - 1) + 1 # Avoid partition 0 for non-VIP
producer = Producer({
"bootstrap.servers": "kafka:9092",
"partitioner": custom_partitioner,
})
Exactly-Once Semantics
"""
Idempotent Producer (enable.idempotence=true):
Each message gets a sequence number per producer epoch.
Broker deduplicates retried messages with same sequence number.
Prevents duplicate writes from producer retries.
Cost: small overhead per message.
Transactional API (transactional.id=...):
Allows atomic writes across multiple partitions/topics.
Use case: read-process-write pipelines (consume, transform, produce)
→ either all writes committed or none.
2-phase commit: producer begins transaction, all brokers either
commit or abort together.
Cost: latency (100ms+ per transaction); use for critical paths only.
Kafka Streams exactly-once:
processing.guarantee=exactly_once_v2 (Kafka 2.5+)
Internally uses transactional API for read-process-write loops.
"""
from confluent_kafka import Producer, Consumer
producer = Producer({
"bootstrap.servers": "kafka:9092",
"enable.idempotence": True,
"transactional.id": "my-service-001", # Unique per producer instance
"acks": "all",
})
consumer = Consumer({
"bootstrap.servers": "kafka:9092",
"group.id": "my-consumer-group",
"isolation.level": "read_committed", # Skip uncommitted tx records
"enable.auto.commit": False,
})
# Exactly-once read-process-write loop
producer.init_transactions()
def process_batch(messages):
producer.begin_transaction()
try:
for msg in messages:
record = process(msg)
producer.produce("output-topic", key=record["key"],
value=record["value"])
# Commit consumer offsets as part of the transaction
offsets = {msg.topic_partition(): msg.offset() + 1 for msg in messages}
producer.send_offsets_to_transaction(offsets, consumer.consumer_group_metadata())
producer.commit_transaction()
except Exception:
producer.abort_transaction()
raise
Kafka Streams vs Flink vs Spark Streaming
| Feature | Kafka Streams | Apache Flink | Spark Structured Streaming |
|---|---|---|---|
| Deployment | Library (no cluster needed) | Separate cluster | Spark cluster |
| State management | RocksDB (local) | RocksDB + remote (configurable) | Spark RDD + checkpoints |
| Latency | Milliseconds | Milliseconds | 100ms+ (micro-batch) |
| Throughput | Medium | Very high | Very high |
| Exactly-once | Yes (Kafka-to-Kafka) | Yes (end-to-end) | Yes (with supported sinks) |
| SQL support | Limited (KSQL) | Flink SQL | Spark SQL |
| Best for | Simple stateful transforms, microservices | Complex event processing, low latency | Batch + streaming unified |
When to Use Kafka vs Alternatives
| Use Kafka when | Consider alternatives when |
|---|---|
| High throughput: millions of events/second | Simple job queues: SQS, RabbitMQ are simpler |
| Message replay needed (re-read history) | Short retention (hours): SQS with visibility timeout |
| Multiple independent consumer groups | Request/reply pattern: use gRPC or HTTP |
| Ordered processing per key | At-most-once IoT telemetry: MQTT is lighter weight |
| Event sourcing / audit log | Pub/sub with fan-out filters: SNS + SQS |
| Data pipeline with multiple sinks | Simple task queues: Celery/Redis Queue |
Companies That Ask This Question
- LinkedIn Engineering Interview Guide
- Databricks Interview Guide
- Airbnb Engineering Interview Guide
- Twitter/X Engineering Interview Guide
- Uber Engineering Interview Guide
Frequently Asked Questions
What guarantees does Kafka provide for message delivery?
Kafka offers three delivery semantics. At-most-once: producer fires and forgets, consumer commits before processing — messages can be lost. At-least-once (default): producer retries on failure, consumer commits after processing — duplicates possible. Exactly-once: requires idempotent producer (enable.idempotence=true) plus transactional API on producer, and read_committed isolation on consumer. Exactly-once across Kafka-to-Kafka pipelines is supported natively; Kafka-to-external-system exactly-once requires idempotent writes at the destination.
How does Kafka achieve high throughput?
Kafka combines several I/O optimizations: (1) sequential disk writes — producers append to the end of a log file, which is as fast as RAM for spinning disks; (2) zero-copy transfers — sendfile() syscall moves data from page cache to socket without copying to user space; (3) batching — producer batches messages before sending, amortizing network RTT; (4) compression — batches are compressed (snappy/lz4/zstd) reducing network and disk I/O; (5) partition parallelism — multiple partitions allow parallel reads and writes across brokers.
What is Kafka consumer group lag and why does it matter?
Consumer group lag is the difference between the latest offset produced to a partition and the committed offset of the consumer group. Lag = 0 means consumers are caught up; high lag means consumers are falling behind producers. Lag matters because it quantifies data freshness — high lag means downstream systems are processing stale data. Monitor lag per partition using kafka-consumer-groups.sh or Burrow. Remedies: increase consumer parallelism (add instances up to partition count), optimize consumer processing, or increase retention so consumers can catch up without data loss.
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What guarantees does Kafka provide for message delivery?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Kafka offers three delivery semantics. At-most-once: producer fires and forgets, consumer commits before processing — messages can be lost. At-least-once (default): producer retries on failure, consumer commits after processing — duplicates possible. Exactly-once: requires idempotent producer (enable.idempotence=true) plus transactional API on producer, and read_committed isolation on consumer. Exactly-once across Kafka-to-Kafka pipelines is supported natively; Kafka-to-external-system exactly-once requires idempotent writes at the destination.”
}
},
{
“@type”: “Question”,
“name”: “How does Kafka achieve high throughput?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Kafka combines several I/O optimizations: (1) sequential disk writes — producers append to the end of a log file, which is as fast as RAM for spinning disks; (2) zero-copy transfers — sendfile() syscall moves data from page cache to socket without copying to user space; (3) batching — producer batches messages before sending, amortizing network RTT; (4) compression — batches are compressed (snappy/lz4/zstd) reducing network and disk I/O; (5) partition parallelism — multiple partitions allow parallel reads and writes across brokers.”
}
},
{
“@type”: “Question”,
“name”: “What is Kafka consumer group lag and why does it matter?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Consumer group lag is the difference between the latest offset produced to a partition and the committed offset of the consumer group. Lag = 0 means consumers are caught up; high lag means consumers are falling behind producers. Lag matters because it quantifies data freshness — high lag means downstream systems are processing stale data. Monitor lag per partition using kafka-consumer-groups.sh or Burrow. Remedies: increase consumer parallelism (add instances up to partition count), optimize consumer processing, or increase retention so consumers can catch up without data loss.”
}
}
]
}