System Design: Message Queues — RabbitMQ vs SQS vs Kafka, Dead Letter Queues, Exactly-Once, Ordering

Message queues are the backbone of asynchronous communication in distributed systems. Choosing the right message queue and understanding its guarantees (ordering, delivery, durability) is critical for system design. This guide compares RabbitMQ, Amazon SQS, and Apache Kafka, and covers production patterns like dead letter queues, exactly-once processing, and idempotent consumers — essential knowledge for system design interviews.

RabbitMQ: Traditional Message Broker

RabbitMQ is an AMQP-based message broker designed for traditional message queuing. Architecture: producers send messages to exchanges, which route messages to queues based on routing rules (direct, topic, fanout, headers). Consumers subscribe to queues and receive messages. Delivery model: push-based — RabbitMQ pushes messages to consumers. Consumer acknowledgment: after processing a message, the consumer sends an ACK. If no ACK is received (consumer crashes), RabbitMQ redelivers the message to another consumer. Ordering: messages within a single queue are delivered in FIFO order to a single consumer. With multiple consumers on one queue, ordering is not guaranteed (messages are distributed round-robin). Durability: messages can be persisted to disk (durable queues + persistent messages). Performance: RabbitMQ handles 10,000-50,000 messages per second per node. With clustering, throughput scales horizontally. Use cases: task queues (background job processing), RPC (request-reply pattern), routing complex workflows (exchange routing rules). Not ideal for: event streaming (messages are deleted after consumption), high-throughput event processing (Kafka is faster for append-only logs).

Amazon SQS: Managed Queue Service

Amazon SQS is a fully managed message queue that requires no infrastructure management. Two queue types: (1) Standard queue — nearly unlimited throughput, at-least-once delivery (messages may be delivered more than once), best-effort ordering (messages may arrive out of order). (2) FIFO queue — exactly-once processing (deduplication within a 5-minute window), strict ordering within a message group (messages with the same MessageGroupId are ordered), throughput limited to 3,000 messages per second with batching. Key features: automatic scaling (SQS scales transparently based on message volume), message visibility timeout (after a consumer receives a message, it is invisible to other consumers for a configurable period — if not deleted, it becomes visible again for reprocessing), dead letter queue (messages that fail processing N times are moved to a DLQ for investigation), and long polling (consumer waits up to 20 seconds for a message, reducing empty responses and API costs). Use cases: decoupling microservices (fire-and-forget), task queues (Lambda triggered by SQS), and any workload where managed infrastructure is preferred over operating RabbitMQ or Kafka clusters. SQS is the default choice on AWS when you need a simple queue without event streaming.

Apache Kafka: Event Streaming Platform

Kafka is fundamentally different from RabbitMQ and SQS. It is a distributed, durable, append-only log. Key differences: (1) Messages are not deleted after consumption — they are retained for a configurable period (7 days default) or indefinitely. Multiple consumer groups can read the same topic independently, each at their own offset. (2) Pull-based consumption — consumers pull messages at their own pace, enabling backpressure (a slow consumer does not affect others). (3) Partitioned topics — a topic is split into partitions. Each partition is an ordered log. Producers assign messages to partitions by key (hash) or round-robin. Each partition is consumed by exactly one consumer per consumer group. This enables parallel consumption: 12 partitions with 4 consumers = 3 partitions per consumer. (4) Throughput — Kafka handles millions of messages per second on a modest cluster. The append-only, sequential I/O design leverages disk bandwidth efficiently. Use cases: event streaming (event-driven architecture), log aggregation, change data capture (CDC), and real-time data pipelines. Kafka is the default for event-driven systems where message replay, multiple consumers, and high throughput are required.

Dead Letter Queues and Error Handling

A dead letter queue (DLQ) captures messages that cannot be processed after multiple attempts. Without a DLQ, a poison message (one that always fails processing) blocks the queue: the consumer receives it, fails, the message is redelivered, the consumer fails again — infinite loop. DLQ configuration: after N failed processing attempts (maxReceiveCount in SQS, x-death count in RabbitMQ), the message is moved to the DLQ. The DLQ is a separate queue where failed messages are stored for inspection. Operational workflow: (1) Monitor the DLQ — alert when messages appear. (2) Inspect failed messages — examine the message content and the processing error. (3) Fix the consumer bug or data issue. (4) Replay messages from the DLQ back to the original queue for reprocessing. SQS: configure a redrive policy with maxReceiveCount and deadLetterTargetArn. RabbitMQ: use the x-dead-letter-exchange and x-dead-letter-routing-key queue arguments. Kafka: Kafka does not have built-in DLQ support. Implement it at the application level: when a consumer fails to process a message after retries, publish it to a dead-letter topic.

Exactly-Once Processing

Message delivery guarantees: (1) At-most-once — the message is delivered zero or one times. No retries. If the consumer crashes before processing, the message is lost. (2) At-least-once — the message is delivered one or more times. The broker retries delivery until acknowledgment. If the consumer processes but crashes before ACKing, the message is redelivered — potentially processing it twice. (3) Exactly-once — the message is processed exactly one time. This is the hardest guarantee to achieve in distributed systems. True exactly-once delivery is impossible in the general case (the Two Generals Problem). Practical exactly-once processing uses at-least-once delivery combined with idempotent consumers. Idempotent consumer pattern: assign each message a unique ID (message_id or idempotency_key). Before processing, check if this ID exists in a processed_messages table. If it does, skip the message (already processed). If not, process it and insert the ID (in the same transaction as the business operation). This ensures that redelivered messages are safely ignored. Kafka 0.11+ supports exactly-once semantics within Kafka (producer idempotency + transactional consumers), but end-to-end exactly-once (Kafka to external database) still requires idempotent consumers.

Choosing the Right Message Queue

Decision framework: (1) Simple task queue with managed infrastructure — Amazon SQS (Standard for most cases, FIFO when ordering matters). No operational overhead. (2) Complex routing requirements (topic-based routing, fanout, RPC) — RabbitMQ. Flexible exchange-queue routing model handles complex message patterns. (3) Event streaming, multiple consumers, event replay — Apache Kafka. The append-only log model supports event sourcing, CDC, and high-throughput streaming. (4) Real-time messaging with low latency — Redis Streams or NATS. Lighter weight than Kafka for simpler streaming use cases. (5) Multi-cloud or on-premise with Kafka compatibility — Apache Pulsar. Combines Kafka-like streaming with traditional queue features and multi-tenancy. In system design interviews: mention the queue choice and justify it. “I chose Kafka because multiple services need to independently consume the same events, and we need event replay for rebuilding read models.” This shows deeper understanding than “I chose Kafka because it is popular.”

Scroll to Top