Question 1

What is a dead letter queue and why is it needed?

Accepted Answer

A dead letter queue (DLQ) holds messages that failed processing after a maximum number of retry attempts. Without a DLQ: (1) Failed messages are retried forever, blocking the queue for subsequent messages. (2) Poison pill messages (malformed payloads that always fail) stall queue processing indefinitely. (3) Failures are silently swallowed — no visibility into what went wrong. With a DLQ: failed messages are moved to a separate queue after N retry attempts. Normal queue processing continues unblocked. Engineers are alerted when messages land in the DLQ. The original payload, error message, and attempt count are preserved for debugging. After fixing the bug, messages can be replayed from the DLQ back to the main queue. DLQs are supported natively by AWS SQS, RabbitMQ, and Azure Service Bus. With Kafka, you implement DLQ manually by producing failed messages to a dlq topic.

Question 2

How do you distinguish transient failures from permanent failures (poison pills)?

Accepted Answer

Transient failure: a processing attempt failed due to a temporary condition — downstream service timeout, DB connection error, rate limit hit. The message may succeed on retry. Indicators: network errors, HTTP 429/503, DB timeout exceptions. Permanent failure (poison pill): the message will fail every time regardless of retries — malformed JSON, missing required field, schema mismatch, reference to a deleted record, business logic validation failure. Indicators: JSON parse errors, null pointer exceptions, validation exceptions, HTTP 400 responses from downstream. Strategy: (1) Catch specific exception types and classify them. Transient exceptions → retry with backoff. Permanent exceptions → route directly to DLQ (no retry). (2) For unknown failures: retry up to max_retries, then DLQ. (3) Include the exception class name in the DLQ message metadata — makes the failure reason searchable in dashboards. (4) Separate DLQ topics by failure type for easier filtering.

Question 3

What retry strategy should you use for transient failures before the DLQ?

Accepted Answer

Exponential backoff with jitter is the standard approach. Base delays: 1s, 5s, 30s, 2min, 10min (5 attempts over ~12 minutes total). Jitter: add ±20% random variation to each delay — prevents synchronized retries from thundering-herding a recovering downstream service. Max attempts: 3-5 for most use cases. Too few: legitimate transient failures don't recover. Too many: a dead downstream keeps your queue backed up for hours. For Kafka: implement retry topics (orders-retry-1, orders-retry-2) with delay achieved via a waiting consumer that re-publishes after the delay. This avoids blocking the main consumer. For SQS: set visibility timeout to the retry delay — the message reappears after the timeout. visibilityTimeout = exponential_delay(attempt_count). After MaxReceiveCount attempts, SQS moves the message to the DLQ automatically. Always alert immediately when a message hits the DLQ, not just when the DLQ depth exceeds a threshold.

Question 4

How do you replay messages from a DLQ after fixing a bug?

Accepted Answer

DLQ replay requires: (1) Identifying which messages to replay — filter by failure_reason, error_type, or time range. (2) Re-publishing to the main queue — read from DLQ, produce to the main queue. (3) Idempotency — the consumer must handle re-delivery of messages that were partially processed before the crash. Use an idempotency key (the original message_id) to detect and skip already-completed operations. Replay tooling: write a script that SELECTs messages from the DLQ table (or reads from the DLQ Kafka topic) and produces them back to the main topic. Add a header: X-DLQ-Replay: true, original-failure-reason: ... so the consumer can log it. Replay in batches: if 100K messages are in the DLQ, replay 1K at a time to avoid overwhelming the main queue. Monitor: watch the main queue depth and error rate during replay — if the fix didn't work, you'll see messages cycling back to the DLQ.

Question 5

How do you monitor a DLQ and alert on failures?

Accepted Answer

DLQ depth is a critical SLI. Monitoring strategy: (1) Alert on DLQ depth > 0 for high-priority queues (payment processing, order fulfillment). Any failure is worth immediate investigation. (2) Alert on DLQ depth growth rate — if messages are accumulating faster than they are being resolved, it indicates an ongoing incident. (3) Alert on time-in-DLQ — if messages have been in the DLQ for > 4 hours without being replayed or discarded, escalate. Dashboard metrics: DLQ depth over time by queue, messages added per hour, failure reason breakdown (pie chart by error_type), oldest message age, time-to-resolution histogram. Integration: Kafka DLQ topics → consume with a monitoring consumer → write metrics to Prometheus → Grafana dashboard. For SQS: CloudWatch metric QueueDepth on the DLQ → CloudWatch Alarm → SNS → PagerDuty. Never let DLQ messages sit silently. A DLQ with no alerts is a data loss waiting to happen.

Dead Letter Queue (DLQ) System Low-Level Design

What is a Dead Letter Queue?

Why Messages Fail

Architecture

Retry Strategy (Exponential Backoff)

DLQ Message Schema

DLQ in Kafka

DLQ Processing and Replay

Alerting and Monitoring

Key Design Decisions