What is a visibility timeout and how does it enable at-least-once delivery?

When a consumer dequeues a message, the message is hidden from other consumers for visibility_timeout seconds. The message is NOT deleted yet. The consumer must explicitly acknowledge (delete) the message after processing. If the consumer crashes, times out, or fails without acknowledging: when visibility_timeout expires, the message becomes visible again and another consumer can pick it up. This guarantees at-least-once delivery: the message will be processed at least once even if consumers fail. The trade-off: if a consumer processes the message successfully but crashes before acknowledging, the message is redelivered and processed twice. Consumers must be idempotent (safe to process the same message twice) or use idempotency keys to detect and skip duplicates.

What is a Dead Letter Queue and when should you use it?

A Dead Letter Queue (DLQ) receives messages that failed processing repeatedly. Configure max_receive_count (e.g., 5): if a message is dequeued and not acknowledged 5 times, it moves to the DLQ instead of becoming visible again. Why use a DLQ: (1) Prevents poison pill messages — a message that always causes consumer crashes would loop forever without a DLQ, blocking the queue and causing repeated failures. (2) Debugging: DLQ messages can be inspected to understand why they failed. (3) Replayability: after fixing the bug, DLQ messages can be moved back to the source queue for reprocessing. Without a DLQ, one bad message can permanently degrade a queue's throughput. Set up CloudWatch/monitoring alerts on DLQ depth — messages in the DLQ mean processing failures that need investigation.

What is the difference between a message queue (SQS) and an event streaming platform (Kafka)?

Message queue (SQS, RabbitMQ): each message is consumed by exactly one consumer (competing consumers model). Message is deleted after successful acknowledgment. No ordering guarantee across messages (within message groups in FIFO queues, ordering is maintained). Best for: task distribution, work queues, fan-out with multiple independent workers each doing different work. Event streaming (Kafka): messages are retained in a log for a configurable period (days/weeks). Multiple consumer groups can each read all messages independently at their own offset. Total ordering within a partition. Best for: event sourcing, audit logs, multiple services each needing their own view of all events, stream processing pipelines, replay and reprocessing. Rule of thumb: if you need "each task done once by one worker," use SQS. If you need "every event seen by multiple independent subscribers," use Kafka.

System Design Interview: Design a Distributed Message Queue (SQS / RabbitMQ)

⏱ 6 min read

What Is a Distributed Message Queue?

A message queue decouples producers (services that generate work) from consumers (services that process work). Producers publish messages without waiting for consumers to process them. Consumers pull messages at their own pace. This provides backpressure handling (consumers can fall behind without blocking producers), fault tolerance (messages persist even if consumers are down), and horizontal scalability (add consumers to increase throughput).

You’re designing something like Amazon SQS, RabbitMQ, or a simplified Kafka. The core difference from Kafka: a traditional message queue deletes messages after consumption (each message consumed by exactly one consumer), while Kafka retains messages as a log (multiple consumer groups can read independently).

DoorDash Interview Guide

Databricks Interview Guide

Airbnb Interview Guide

Shopify Interview Guide

Stripe Interview Guide

Uber Interview Guide

System Requirements

Functional

Producers enqueue messages to named queues
Consumers dequeue messages (one consumer receives each message)
At-least-once delivery: no message lost, possible duplicate on failure
Visibility timeout: dequeued message is hidden from others for N seconds; if not acknowledged, becomes visible again
Dead letter queue: messages that fail repeatedly move to a DLQ
Message ordering: FIFO queues preserve order within a message group

Non-Functional

High throughput: 100K messages/second per queue
Low latency: enqueue <10ms p99, dequeue <20ms p99
Durability: messages persisted to disk (replicated) before acknowledging to producer

Core Data Model

queues: id, name, visibility_timeout_sec, max_receive_count, dlq_id
messages: id, queue_id, body, status (available/invisible/deleted), receive_count,
          enqueued_at, visible_at, receipt_handle

Enqueue (Producer)

Producer sends message to broker via HTTP or gRPC
Broker assigns message_id (UUID), sets status=available, visible_at=now
Broker writes to WAL (Write-Ahead Log) on disk — durability before acknowledgment
Broker replicates to N-1 follower nodes (synchronous replication for durability)
Broker acknowledges to producer with message_id

Dequeue (Consumer)

Consumer sends dequeue request (with max_messages count, e.g., 10)
Broker selects up to 10 messages with status=available AND visible_at <= now
Broker sets status=invisible, visible_at=now + visibility_timeout, generates receipt_handle
Broker returns messages with their receipt handles
Consumer processes messages, sends Acknowledge(receipt_handle) for each
Broker marks message as deleted

If consumer crashes without acknowledging: when visibility_timeout expires, the message becomes visible again and a different consumer picks it up. This is the at-least-once delivery guarantee — duplicates are possible on failure.

Visibility Timeout and Redelivery

The visibility_timeout is critical. If too short: slow-processing consumers cause redeliveries (duplicates). If too long: failed messages stay hidden too long, causing delays. Best practice: set visibility_timeout to 3-6x the expected processing time. Consumers that need more time can extend their visibility lease via an ExtendVisibility API call.

Dead letter queue: after max_receive_count deliveries (e.g., 5), the message is moved to the DLQ instead of becoming visible again. DLQ messages can be inspected for debugging and replayed after fixing the bug.

Storage Layer

Two options depending on scale:

Database-backed (SQS approach)

Store messages in a distributed database (DynamoDB, Cassandra). Simple, proven. Query for available messages: WHERE status='available' AND visible_at <= now LIMIT 10. Use optimistic locking or conditional writes to atomically claim messages. Index on (queue_id, status, visible_at) for efficient dequeue queries. Works well up to millions of messages in flight.

Log-based (Kafka-style)

Append messages to an ordered log per queue. Track consumer offset (next message index to deliver). No deletion — mark consumed via offset advancement. More efficient (sequential writes), but tracking per-message visibility state is complex. Better for high-throughput, ordered scenarios.

Ensuring FIFO Ordering

Basic queues don’t guarantee order (multiple consumers, concurrent dequeues). For FIFO:

Single-consumer constraint: only one consumer processes a given message group at a time
Message group ID: producers tag messages with a group_id. Messages with the same group_id are locked: only one consumer can have an in-flight message from a group at once. Others wait.
Implementation: group_id → consumer_id lock in Redis with TTL = visibility_timeout. Released on acknowledgment or timeout.

Scaling

Partitioning: split a high-volume queue across multiple broker partitions. Producer hashes message key to partition. Consumers assigned to partitions. Ordering guaranteed within a partition, not across partitions.
Consumer scaling: add consumers to process faster. Automatically distributed across partitions via a group coordinator (similar to Kafka’s consumer group protocol).
Broker scaling: add broker nodes. Reassign partitions to new brokers. Leader election via Raft or ZooKeeper.

Interview Tips

The visibility timeout mechanism is the key idea — explain it carefully. It’s what enables at-least-once delivery without a central lock.
Distinguish traditional queue (each message consumed once, then deleted) from Kafka log (messages retained, multiple consumer groups). Know which you’re designing.
Dead letter queue is a real production concern — mention max_receive_count and DLQ routing to show operational maturity.
FIFO queues: the group_id locking pattern is the right answer — prevents two consumers from processing group messages out of order.

Asked at: Netflix Interview Guide

Asked at: Twitter/X Interview Guide

Asked at: LinkedIn Interview Guide