What is a Dead Letter Queue?
A Dead Letter Queue (DLQ) holds messages that could not be processed successfully after a maximum number of retry attempts. Without a DLQ, failed messages are either lost or stuck in a queue retrying forever, blocking subsequent messages. A DLQ captures these “dead” messages for inspection, debugging, and manual or automated reprocessing. Used extensively with Kafka, SQS, RabbitMQ, and any message-driven architecture.
Why Messages Fail
- Poison pill messages: malformed payload, unexpected format, schema mismatch — will fail every retry
- Transient failures: downstream service unavailable, DB timeout — may succeed on retry
- Logic bugs: consumer code has a bug that causes a crash on specific inputs
- Missing dependencies: message references a resource (order_id, user_id) that was deleted
- Timeout: processing takes too long — consumer times out and returns a failure
Architecture
Producer → Main Queue (Kafka topic / SQS queue)
→ Consumer (attempts processing, max N retries)
→ On success: ACK message (delete from queue)
→ On transient failure: NACK, delay and retry (exponential backoff)
→ On max retries exceeded: route to DLQ
DLQ → DLQ Monitor (alerts, dashboards)
→ DLQ Processor (replay or discard)
→ Human investigation
Retry Strategy (Exponential Backoff)
retry_delays = [1s, 5s, 30s, 2min, 10min] # max 5 attempts
class ConsumerWithDLQ:
def process(self, message):
for attempt in range(MAX_RETRIES):
try:
self.handle(message)
self.ack(message)
return
except TransientError as e:
if attempt < MAX_RETRIES - 1:
sleep(retry_delays[attempt])
continue
else:
self.dlq.send(message, error=str(e), attempts=attempt+1)
self.ack(message) # remove from main queue
return
except PermanentError as e:
# No retry for permanent failures (poison pill)
self.dlq.send(message, error=str(e), attempts=attempt+1)
self.ack(message)
return
DLQ Message Schema
DLQMessage(dlq_message_id UUID, source_topic VARCHAR, source_partition INT,
source_offset BIGINT, original_payload JSONB,
error_type ENUM(TRANSIENT,PERMANENT,TIMEOUT,SCHEMA),
error_message TEXT, attempt_count INT,
first_failed_at TIMESTAMP, last_failed_at TIMESTAMP,
status ENUM(PENDING,REPLAYED,DISCARDED),
resolution_note TEXT)
DLQ in Kafka
Kafka does not have native DLQ support. Implementation: when a consumer exhausts retries, produce the message to a separate DLQ topic (e.g., orders-dlq). Include metadata headers: original-topic, original-partition, original-offset, failure-reason, attempt-count. A separate DLQ consumer reads from orders-dlq for monitoring and reprocessing. Kafka Streams and Spring Kafka have built-in DLQ support. For SQS: set MaxReceiveCount on the queue; SQS automatically moves messages to the configured dead-letter queue after MaxReceiveCount receives.
DLQ Processing and Replay
Options for DLQ messages: (1) Manual investigation: engineers inspect DLQ messages via a dashboard, fix the bug in the consumer, then replay. (2) Automated replay: after fixing a consumer bug, use a replay tool to re-publish DLQ messages back to the main queue in batches. (3) Discard: if the message is truly unprocessable (corrupted, references deleted data), mark it DISCARDED with a note explaining why. Replay idempotency: the consumer must be idempotent — replaying a message that was partially processed (side effects before the crash) must not cause double-processing. Use an idempotency key (message_id) to detect and skip already-processed messages.
Alerting and Monitoring
DLQ depth is a key SLI (Service Level Indicator). Alert on: DLQ depth > 0 (any failure is worth knowing), DLQ depth growing (failures accumulating faster than they are being resolved), DLQ depth > threshold (SLO violation). Dashboard metrics: DLQ depth over time by topic, failure reason breakdown (schema vs transient vs permanent), time-to-resolution (how long messages stay in DLQ before being replayed or discarded).
Key Design Decisions
- Separate transient from permanent errors — transient retries; permanent goes directly to DLQ
- Preserve original message and metadata in DLQ — essential for debugging
- Idempotent consumers — enables safe replay without duplicate processing
- Alert on DLQ depth > 0 — failures are never silently swallowed
- Replay tooling — makes fixing bugs and recovering from failures fast
Atlassian system design covers message reliability and DLQ patterns. See common questions for Atlassian interview: dead letter queue and message reliability design.
Amazon system design covers SQS and dead letter queues. Review patterns for Amazon interview: SQS dead letter queue and message queue design.
Shopify system design covers reliable order processing and DLQ. See design patterns for Shopify interview: dead letter queue and order processing reliability.
See also: Stripe Interview Guide 2026: Process, Bug Bash Round, and Payment Systems