Question 1

What is the difference between at-most-once, at-least-once, and exactly-once delivery?

Accepted Answer

At-most-once: messages may be lost but are never duplicated. The producer sends once without retry. Simple to implement (fire and forget); use when occasional message loss is acceptable (UDP metrics, non-critical notifications). At-least-once: messages are never lost but may be delivered multiple times. The producer retries until acknowledged; the consumer must handle duplicates via idempotent processing. Default in most message brokers (Kafka, RabbitMQ). Exactly-once: messages are never lost and never duplicated. The hardest guarantee — requires coordination between producer, broker, and consumer (idempotent producer, transactional APIs, idempotent consumers). Kafka provides exactly-once within a Kafka cluster; across external systems, it requires the outbox pattern or two-phase commit.

Question 2

How does Kafka achieve exactly-once semantics?

Accepted Answer

Kafka exactly-once uses two mechanisms: (1) Idempotent producers: each producer receives a producer ID (PID); each message has a per-partition sequence number. The broker deduplicates by (PID, partition, sequence_number) — if the same sequence number arrives twice (producer retry), acknowledge without writing again. (2) Transactional API: producer.beginTransaction(); producer.send(topicA, msg1); producer.send(topicB, msg2); producer.commitTransaction(). A transaction coordinator manages atomic commit across partitions using two-phase commit. Consumers configured with isolation_level=read_committed only see messages from committed transactions. Combined: producers can write to multiple partitions atomically, and brokers deduplicate retries, delivering exactly-once producer-to-broker semantics.

Question 3

How does Flink achieve exactly-once for stream processing?

Accepted Answer

Flink exactly-once uses distributed snapshots (Chandy-Lamport algorithm): periodically, Flink injects checkpoint barriers into all input streams. When an operator receives barriers from all inputs, it snapshots its state and sends barriers downstream. The JobManager collects snapshot acknowledgments from all operators. When all operators have snapshotted, the checkpoint is complete. On failure: restore all operator states from the last checkpoint and replay input from the checkpointed consumer offsets. Exactly-once output requires an idempotent or transactional sink: Flink commits offsets and sink writes atomically only after the checkpoint completes. If the job fails after the sink write but before the checkpoint completes, the job replays from the previous checkpoint — the sink must handle the re-delivery idempotently (via upsert or transaction rollback).

Question 4

Is it better to use exactly-once delivery or idempotent consumers?

Accepted Answer

In most systems, at-least-once delivery with idempotent consumers is simpler and more cost-effective than true exactly-once infrastructure. Kafka exactly-once adds ~20-30% throughput reduction and 2x latency for transactional produce. Idempotent consumers achieve the same observable result at lower cost: track processed event_ids in a deduplication table (INSERT ON CONFLICT DO NOTHING) and process each event once regardless of how many times it is delivered. Design consumers to be naturally idempotent: use UPSERT instead of INSERT, make processing a pure function of the event's state. Use exactly-once infrastructure when: the business logic truly cannot be made idempotent (financial settlement, inventory decrement that must not repeat), or when the cost of at-least-once deduplication exceeds the cost of exactly-once delivery.

Exactly-Once Delivery: Low-Level Design

The Three Delivery Semantics

Kafka Exactly-Once

End-to-End Exactly-Once in Stream Processing

Exactly-Once vs. Idempotent Processing

Cost of Exactly-Once