System Design Interview: Design a Distributed Message Queue (SQS / RabbitMQ)

What Is a Distributed Message Queue?

A message queue decouples producers (services that generate work) from consumers (services that process work). Producers publish messages without waiting for consumers to process them. Consumers pull messages at their own pace. This provides backpressure handling (consumers can fall behind without blocking producers), fault tolerance (messages persist even if consumers are down), and horizontal scalability (add consumers to increase throughput).

You’re designing something like Amazon SQS, RabbitMQ, or a simplified Kafka. The core difference from Kafka: a traditional message queue deletes messages after consumption (each message consumed by exactly one consumer), while Kafka retains messages as a log (multiple consumer groups can read independently).

  • DoorDash Interview Guide
  • Databricks Interview Guide
  • Airbnb Interview Guide
  • Shopify Interview Guide
  • Stripe Interview Guide
  • Uber Interview Guide
  • System Requirements

    Functional

    • Producers enqueue messages to named queues
    • Consumers dequeue messages (one consumer receives each message)
    • At-least-once delivery: no message lost, possible duplicate on failure
    • Visibility timeout: dequeued message is hidden from others for N seconds; if not acknowledged, becomes visible again
    • Dead letter queue: messages that fail repeatedly move to a DLQ
    • Message ordering: FIFO queues preserve order within a message group

    Non-Functional

    • High throughput: 100K messages/second per queue
    • Low latency: enqueue <10ms p99, dequeue <20ms p99
    • Durability: messages persisted to disk (replicated) before acknowledging to producer

    Core Data Model

    queues: id, name, visibility_timeout_sec, max_receive_count, dlq_id
    messages: id, queue_id, body, status (available/invisible/deleted), receive_count,
              enqueued_at, visible_at, receipt_handle
    

    Enqueue (Producer)

    1. Producer sends message to broker via HTTP or gRPC
    2. Broker assigns message_id (UUID), sets status=available, visible_at=now
    3. Broker writes to WAL (Write-Ahead Log) on disk — durability before acknowledgment
    4. Broker replicates to N-1 follower nodes (synchronous replication for durability)
    5. Broker acknowledges to producer with message_id

    Dequeue (Consumer)

    1. Consumer sends dequeue request (with max_messages count, e.g., 10)
    2. Broker selects up to 10 messages with status=available AND visible_at <= now
    3. Broker sets status=invisible, visible_at=now + visibility_timeout, generates receipt_handle
    4. Broker returns messages with their receipt handles
    5. Consumer processes messages, sends Acknowledge(receipt_handle) for each
    6. Broker marks message as deleted

    If consumer crashes without acknowledging: when visibility_timeout expires, the message becomes visible again and a different consumer picks it up. This is the at-least-once delivery guarantee — duplicates are possible on failure.

    Visibility Timeout and Redelivery

    The visibility_timeout is critical. If too short: slow-processing consumers cause redeliveries (duplicates). If too long: failed messages stay hidden too long, causing delays. Best practice: set visibility_timeout to 3-6x the expected processing time. Consumers that need more time can extend their visibility lease via an ExtendVisibility API call.

    Dead letter queue: after max_receive_count deliveries (e.g., 5), the message is moved to the DLQ instead of becoming visible again. DLQ messages can be inspected for debugging and replayed after fixing the bug.

    Storage Layer

    Two options depending on scale:

    Database-backed (SQS approach)

    Store messages in a distributed database (DynamoDB, Cassandra). Simple, proven. Query for available messages: WHERE status='available' AND visible_at <= now LIMIT 10. Use optimistic locking or conditional writes to atomically claim messages. Index on (queue_id, status, visible_at) for efficient dequeue queries. Works well up to millions of messages in flight.

    Log-based (Kafka-style)

    Append messages to an ordered log per queue. Track consumer offset (next message index to deliver). No deletion — mark consumed via offset advancement. More efficient (sequential writes), but tracking per-message visibility state is complex. Better for high-throughput, ordered scenarios.

    Ensuring FIFO Ordering

    Basic queues don’t guarantee order (multiple consumers, concurrent dequeues). For FIFO:

    • Single-consumer constraint: only one consumer processes a given message group at a time
    • Message group ID: producers tag messages with a group_id. Messages with the same group_id are locked: only one consumer can have an in-flight message from a group at once. Others wait.
    • Implementation: group_id → consumer_id lock in Redis with TTL = visibility_timeout. Released on acknowledgment or timeout.

    Scaling

    • Partitioning: split a high-volume queue across multiple broker partitions. Producer hashes message key to partition. Consumers assigned to partitions. Ordering guaranteed within a partition, not across partitions.
    • Consumer scaling: add consumers to process faster. Automatically distributed across partitions via a group coordinator (similar to Kafka’s consumer group protocol).
    • Broker scaling: add broker nodes. Reassign partitions to new brokers. Leader election via Raft or ZooKeeper.

    Interview Tips

    • The visibility timeout mechanism is the key idea — explain it carefully. It’s what enables at-least-once delivery without a central lock.
    • Distinguish traditional queue (each message consumed once, then deleted) from Kafka log (messages retained, multiple consumer groups). Know which you’re designing.
    • Dead letter queue is a real production concern — mention max_receive_count and DLQ routing to show operational maturity.
    • FIFO queues: the group_id locking pattern is the right answer — prevents two consumers from processing group messages out of order.

    {
    “@context”: “https://schema.org”,
    “@type”: “FAQPage”,
    “mainEntity”: [
    {
    “@type”: “Question”,
    “name”: “What is a visibility timeout and how does it enable at-least-once delivery?”,
    “acceptedAnswer”: { “@type”: “Answer”, “text”: “When a consumer dequeues a message, the message is hidden from other consumers for visibility_timeout seconds. The message is NOT deleted yet. The consumer must explicitly acknowledge (delete) the message after processing. If the consumer crashes, times out, or fails without acknowledging: when visibility_timeout expires, the message becomes visible again and another consumer can pick it up. This guarantees at-least-once delivery: the message will be processed at least once even if consumers fail. The trade-off: if a consumer processes the message successfully but crashes before acknowledging, the message is redelivered and processed twice. Consumers must be idempotent (safe to process the same message twice) or use idempotency keys to detect and skip duplicates.” }
    },
    {
    “@type”: “Question”,
    “name”: “What is a Dead Letter Queue and when should you use it?”,
    “acceptedAnswer”: { “@type”: “Answer”, “text”: “A Dead Letter Queue (DLQ) receives messages that failed processing repeatedly. Configure max_receive_count (e.g., 5): if a message is dequeued and not acknowledged 5 times, it moves to the DLQ instead of becoming visible again. Why use a DLQ: (1) Prevents poison pill messages — a message that always causes consumer crashes would loop forever without a DLQ, blocking the queue and causing repeated failures. (2) Debugging: DLQ messages can be inspected to understand why they failed. (3) Replayability: after fixing the bug, DLQ messages can be moved back to the source queue for reprocessing. Without a DLQ, one bad message can permanently degrade a queue's throughput. Set up CloudWatch/monitoring alerts on DLQ depth — messages in the DLQ mean processing failures that need investigation.” }
    },
    {
    “@type”: “Question”,
    “name”: “What is the difference between a message queue (SQS) and an event streaming platform (Kafka)?”,
    “acceptedAnswer”: { “@type”: “Answer”, “text”: “Message queue (SQS, RabbitMQ): each message is consumed by exactly one consumer (competing consumers model). Message is deleted after successful acknowledgment. No ordering guarantee across messages (within message groups in FIFO queues, ordering is maintained). Best for: task distribution, work queues, fan-out with multiple independent workers each doing different work. Event streaming (Kafka): messages are retained in a log for a configurable period (days/weeks). Multiple consumer groups can each read all messages independently at their own offset. Total ordering within a partition. Best for: event sourcing, audit logs, multiple services each needing their own view of all events, stream processing pipelines, replay and reprocessing. Rule of thumb: if you need "each task done once by one worker," use SQS. If you need "every event seen by multiple independent subscribers," use Kafka.” }
    }
    ]
    }

    Asked at: Netflix Interview Guide

    Asked at: Twitter/X Interview Guide

    Asked at: LinkedIn Interview Guide

    Scroll to Top