Dead Letter Queue (DLQ) System Low-Level Design

What is a Dead Letter Queue?

A Dead Letter Queue (DLQ) holds messages that could not be processed successfully after a maximum number of retry attempts. Without a DLQ, failed messages are either lost or stuck in a queue retrying forever, blocking subsequent messages. A DLQ captures these “dead” messages for inspection, debugging, and manual or automated reprocessing. Used extensively with Kafka, SQS, RabbitMQ, and any message-driven architecture.

Why Messages Fail

  • Poison pill messages: malformed payload, unexpected format, schema mismatch — will fail every retry
  • Transient failures: downstream service unavailable, DB timeout — may succeed on retry
  • Logic bugs: consumer code has a bug that causes a crash on specific inputs
  • Missing dependencies: message references a resource (order_id, user_id) that was deleted
  • Timeout: processing takes too long — consumer times out and returns a failure

Architecture

Producer → Main Queue (Kafka topic / SQS queue)
         → Consumer (attempts processing, max N retries)
           → On success: ACK message (delete from queue)
           → On transient failure: NACK, delay and retry (exponential backoff)
           → On max retries exceeded: route to DLQ

DLQ → DLQ Monitor (alerts, dashboards)
    → DLQ Processor (replay or discard)
    → Human investigation

Retry Strategy (Exponential Backoff)

retry_delays = [1s, 5s, 30s, 2min, 10min]  # max 5 attempts

class ConsumerWithDLQ:
    def process(self, message):
        for attempt in range(MAX_RETRIES):
            try:
                self.handle(message)
                self.ack(message)
                return
            except TransientError as e:
                if attempt < MAX_RETRIES - 1:
                    sleep(retry_delays[attempt])
                    continue
                else:
                    self.dlq.send(message, error=str(e), attempts=attempt+1)
                    self.ack(message)  # remove from main queue
                    return
            except PermanentError as e:
                # No retry for permanent failures (poison pill)
                self.dlq.send(message, error=str(e), attempts=attempt+1)
                self.ack(message)
                return

DLQ Message Schema

DLQMessage(dlq_message_id UUID, source_topic VARCHAR, source_partition INT,
           source_offset BIGINT, original_payload JSONB,
           error_type ENUM(TRANSIENT,PERMANENT,TIMEOUT,SCHEMA),
           error_message TEXT, attempt_count INT,
           first_failed_at TIMESTAMP, last_failed_at TIMESTAMP,
           status ENUM(PENDING,REPLAYED,DISCARDED),
           resolution_note TEXT)

DLQ in Kafka

Kafka does not have native DLQ support. Implementation: when a consumer exhausts retries, produce the message to a separate DLQ topic (e.g., orders-dlq). Include metadata headers: original-topic, original-partition, original-offset, failure-reason, attempt-count. A separate DLQ consumer reads from orders-dlq for monitoring and reprocessing. Kafka Streams and Spring Kafka have built-in DLQ support. For SQS: set MaxReceiveCount on the queue; SQS automatically moves messages to the configured dead-letter queue after MaxReceiveCount receives.

DLQ Processing and Replay

Options for DLQ messages: (1) Manual investigation: engineers inspect DLQ messages via a dashboard, fix the bug in the consumer, then replay. (2) Automated replay: after fixing a consumer bug, use a replay tool to re-publish DLQ messages back to the main queue in batches. (3) Discard: if the message is truly unprocessable (corrupted, references deleted data), mark it DISCARDED with a note explaining why. Replay idempotency: the consumer must be idempotent — replaying a message that was partially processed (side effects before the crash) must not cause double-processing. Use an idempotency key (message_id) to detect and skip already-processed messages.

Alerting and Monitoring

DLQ depth is a key SLI (Service Level Indicator). Alert on: DLQ depth > 0 (any failure is worth knowing), DLQ depth growing (failures accumulating faster than they are being resolved), DLQ depth > threshold (SLO violation). Dashboard metrics: DLQ depth over time by topic, failure reason breakdown (schema vs transient vs permanent), time-to-resolution (how long messages stay in DLQ before being replayed or discarded).

Key Design Decisions

  • Separate transient from permanent errors — transient retries; permanent goes directly to DLQ
  • Preserve original message and metadata in DLQ — essential for debugging
  • Idempotent consumers — enables safe replay without duplicate processing
  • Alert on DLQ depth > 0 — failures are never silently swallowed
  • Replay tooling — makes fixing bugs and recovering from failures fast

Atlassian system design covers message reliability and DLQ patterns. See common questions for Atlassian interview: dead letter queue and message reliability design.

Amazon system design covers SQS and dead letter queues. Review patterns for Amazon interview: SQS dead letter queue and message queue design.

Shopify system design covers reliable order processing and DLQ. See design patterns for Shopify interview: dead letter queue and order processing reliability.

See also: Stripe Interview Guide 2026: Process, Bug Bash Round, and Payment Systems

See also: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence

Scroll to Top