Question 1

How do you ensure at-least-once delivery in a task queue?

Accepted Answer

The visibility timeout pattern: when a worker dequeues a task, the task is NOT deleted. Instead, it's moved to an inflight set with an expiry (visibility timeout, e.g., 30 seconds). The worker must explicitly ACK (delete from inflight) after successfully processing. If the worker crashes before ACKing, the task's expiry fires and a background job returns it to the main queue. Redis implementation: ZPOPMIN task_queue returns the task_id; ZADD inflight {now + 30s} {task_id}. On success: ZREM inflight {task_id} + update status=DONE. On crash: background job does ZRANGEBYSCORE inflight 0 {now} every 30 seconds, returning timed-out tasks to the main queue. This is identical to SQS's VisibilityTimeout mechanism.

Question 2

How do you implement retry with exponential backoff in a task queue?

Accepted Answer

On task failure: increment attempts counter. If attempts < max_attempts: calculate next retry time = now + base_delay * 2^(attempts-1) (exponential backoff). Add task back to the queue with score = next_retry_time (for scheduled execution). On next dequeue at the right time, the task is retried. Example: base_delay=10s, attempts=1→retry in 10s, attempts=2→20s, attempts=3→40s. Add jitter to prevent thundering herd: retry_time = base_delay * 2^attempts + random(0, base_delay). If attempts >= max_attempts: move to dead letter queue (separate Redis key or DB table). Store failure_reason with the task for debugging. Alert when DLQ length exceeds a threshold.

Question 3

How do you implement a priority queue for tasks?

Accepted Answer

Two approaches: (1) Multiple queues by priority level: high_priority_queue, normal_queue, low_priority_queue. Worker checks high_priority_queue first; if empty, checks normal_queue; then low_priority_queue. Simple but risk of starvation (low-priority tasks never processed while high-priority queue is non-empty). (2) Single sorted set with composite score: score = -priority * 10^12 + scheduled_at_unix. Higher priority = more negative score = ZPOPMIN returns it first. Mix of priorities: a priority-5 task scheduled at time T has score -5*10^12+T, which is lower than a priority-4 task at time T (-4*10^12+T). Prevents starvation by scheduling all tasks together — a high-priority task 10 minutes from now can be superseded by a low-priority task due now.

Question 4

How do you handle scheduled tasks (cron jobs) in a task queue?

Accepted Answer

Maintain a separate sorted set for scheduled tasks: key=scheduled_queue, score=execute_at (Unix timestamp). To schedule: ZADD scheduled_queue {execute_at} {task_id}. A scheduler process runs every 1 second: ZRANGEBYSCORE scheduled_queue 0 {now} LIMIT 100 — retrieves due tasks. For each: ZREM scheduled_queue {task_id}, ZADD task_queue {priority_score} {task_id}. For recurring tasks (cron-style): after moving to the main queue, calculate the next occurrence and ZADD it back to scheduled_queue. The scheduler must be a single instance (use a distributed lock to prevent duplicate dequeuing). Or use a leaderless approach: wrap ZRANGE + ZREM in a Lua script for atomic check-and-remove.

Question 5

What is a dead letter queue and why do you need one?

Accepted Answer

A dead letter queue (DLQ) receives tasks that have exceeded their maximum retry count. Without a DLQ, permanently failing tasks would retry forever, consuming worker capacity. With a DLQ: failing tasks are quarantined after N attempts (typically 3-5). The DLQ stores the task payload, failure reason, last error, and attempt count. Operations: (1) Inspect: engineers examine why tasks failed (bug? dependency down?). (2) Manual retry: after fixing the bug, move tasks from DLQ back to the main queue. (3) Bulk discard: delete DLQ entries after confirming they should not be retried. (4) Alerting: DLQ length > threshold triggers a PagerDuty alert. The DLQ acts as a circuit breaker — it prevents a broken worker or dependency from causing unbounded retries that mask the real problem.

Task Queue System Low-Level Design

Requirements

Architecture

Data Model

Queue Backend: Redis-Based

At-Least-Once Delivery

Priority Queues

Scheduled Tasks (Cron Jobs)

Dead Letter Queue

Key Design Decisions