Task Queue System Low-Level Design – Tech Interview Dot Org

Requirements

Producers enqueue tasks; workers process them asynchronously
At-least-once delivery: tasks must not be lost on worker crash
Retry with backoff on task failure
Priority queues: high-priority tasks processed before low-priority
Scheduled tasks: execute at a specific future time (cron-style)
Dead letter queue: tasks that fail too many times go to a separate queue for inspection

Architecture

Producer → Task API → Queue Backend (Redis/SQS) → Worker Pool
                                                  → Retry Queue (on failure)
                                                  → Dead Letter Queue (max retries exceeded)

Data Model

Task(task_id UUID, type, payload JSON, priority INT, status ENUM(PENDING,RUNNING,DONE,FAILED),
     attempts INT, max_attempts INT, scheduled_at TIMESTAMP, visible_at TIMESTAMP,
     created_at, worker_id)

Queue Backend: Redis-Based

Use Redis sorted sets for priority + scheduling:

# Enqueue: score = priority * -1e12 + scheduled_at (lower score = higher priority)
ZADD task_queue {score} {task_id}
HSET task:{task_id} payload {json} type {type} attempts 0 max_attempts 3

# Dequeue (worker polling): atomically get and remove lowest-score task
# Use Lua script for atomic dequeue + lock:
ZPOPMIN task_queue 1  → returns [task_id, score]
# Set visibility timeout (re-add with future score if not acked)
ZADD task_inflight {now + 30s} {task_id}

Visibility timeout pattern (like SQS): when a worker dequeues a task, it’s moved to an inflight set with a timeout. If the worker crashes before completing, the task reappears when the timeout expires. The worker must ACK (delete from inflight) after successful completion.

At-Least-Once Delivery

Transaction flow: (1) Worker calls ZPOPMIN on the task queue. (2) Worker adds task to inflight set with score = now + visibility_timeout. (3) Worker processes the task. (4a) On success: DELETE task from inflight, UPDATE task status=DONE. (4b) On failure: increment attempts; if attempts < max_attempts, re-add to task queue with exponential backoff score (score += 2^attempts * base_delay); if attempts >= max_attempts, move to dead letter queue. A background job runs every 30 seconds: scan inflight set for tasks with score < NOW() (timed out), return them to the task queue. This ensures tasks are never lost on worker crash.

Priority Queues

Three approaches: (1) Multiple sorted sets (one per priority level). Worker checks high-priority queue first, falls back to lower. (2) Single sorted set with priority encoded in score: score = -priority * 10^12 + scheduled_at_unix. Higher priority = more negative score = popped first. (3) Weighted random selection: pull from high-priority 80% of the time, normal 20%. Prevents starvation of low-priority tasks when high-priority tasks are abundant.

Scheduled Tasks (Cron Jobs)

Scheduled tasks use a separate sorted set: key=scheduled_tasks, score=execute_at timestamp. A scheduler process (runs every 1s): ZRANGEBYSCORE scheduled_tasks 0 {now} LIMIT 100. For each due task: ZADD task_queue (computing priority score), enqueue the next occurrence if recurring. The scheduler must run in a single instance (or use a distributed lock) to prevent double-scheduling. Use a Redis lock: SET scheduler_lock {instance_id} NX PX 5000; release on each loop completion.

Dead Letter Queue

Tasks exceeding max_attempts are moved to a DLQ: key=dlq:{task_type}. DLQ stores: task_id, payload, failure_reason, last_error_at, attempt_count. Operations on DLQ: (1) Manual retry: move task back to main queue after investigating. (2) Bulk discard: delete all DLQ tasks of a given type after a fix is deployed. (3) Alerting: send alert when DLQ length exceeds threshold (task type has a systemic failure). DLQ tasks retained for 7 days for debugging.

Key Design Decisions

Visibility timeout pattern (not DELETE on dequeue) — ensures at-least-once delivery on worker crash
Exponential backoff on retry — prevents hammering a failing dependency
Separate sorted sets for scheduled tasks and active queue — clean separation of concerns
Dead letter queue — prevents repeatedly retrying a permanently broken task
Single scheduler instance with lock — prevents duplicate cron job execution

{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”How do you ensure at-least-once delivery in a task queue?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”The visibility timeout pattern: when a worker dequeues a task, the task is NOT deleted. Instead, it's moved to an inflight set with an expiry (visibility timeout, e.g., 30 seconds). The worker must explicitly ACK (delete from inflight) after successfully processing. If the worker crashes before ACKing, the task's expiry fires and a background job returns it to the main queue. Redis implementation: ZPOPMIN task_queue returns the task_id; ZADD inflight {now + 30s} {task_id}. On success: ZREM inflight {task_id} + update status=DONE. On crash: background job does ZRANGEBYSCORE inflight 0 {now} every 30 seconds, returning timed-out tasks to the main queue. This is identical to SQS's VisibilityTimeout mechanism.”}},{“@type”:”Question”,”name”:”How do you implement retry with exponential backoff in a task queue?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”On task failure: increment attempts counter. If attempts < max_attempts: calculate next retry time = now + base_delay * 2^(attempts-1) (exponential backoff). Add task back to the queue with score = next_retry_time (for scheduled execution). On next dequeue at the right time, the task is retried. Example: base_delay=10s, attempts=1→retry in 10s, attempts=2→20s, attempts=3→40s. Add jitter to prevent thundering herd: retry_time = base_delay * 2^attempts + random(0, base_delay). If attempts >= max_attempts: move to dead letter queue (separate Redis key or DB table). Store failure_reason with the task for debugging. Alert when DLQ length exceeds a threshold.”}},{“@type”:”Question”,”name”:”How do you implement a priority queue for tasks?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Two approaches: (1) Multiple queues by priority level: high_priority_queue, normal_queue, low_priority_queue. Worker checks high_priority_queue first; if empty, checks normal_queue; then low_priority_queue. Simple but risk of starvation (low-priority tasks never processed while high-priority queue is non-empty). (2) Single sorted set with composite score: score = -priority * 10^12 + scheduled_at_unix. Higher priority = more negative score = ZPOPMIN returns it first. Mix of priorities: a priority-5 task scheduled at time T has score -5*10^12+T, which is lower than a priority-4 task at time T (-4*10^12+T). Prevents starvation by scheduling all tasks together — a high-priority task 10 minutes from now can be superseded by a low-priority task due now.”}},{“@type”:”Question”,”name”:”How do you handle scheduled tasks (cron jobs) in a task queue?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Maintain a separate sorted set for scheduled tasks: key=scheduled_queue, score=execute_at (Unix timestamp). To schedule: ZADD scheduled_queue {execute_at} {task_id}. A scheduler process runs every 1 second: ZRANGEBYSCORE scheduled_queue 0 {now} LIMIT 100 — retrieves due tasks. For each: ZREM scheduled_queue {task_id}, ZADD task_queue {priority_score} {task_id}. For recurring tasks (cron-style): after moving to the main queue, calculate the next occurrence and ZADD it back to scheduled_queue. The scheduler must be a single instance (use a distributed lock to prevent duplicate dequeuing). Or use a leaderless approach: wrap ZRANGE + ZREM in a Lua script for atomic check-and-remove.”}},{“@type”:”Question”,”name”:”What is a dead letter queue and why do you need one?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”A dead letter queue (DLQ) receives tasks that have exceeded their maximum retry count. Without a DLQ, permanently failing tasks would retry forever, consuming worker capacity. With a DLQ: failing tasks are quarantined after N attempts (typically 3-5). The DLQ stores the task payload, failure reason, last error, and attempt count. Operations: (1) Inspect: engineers examine why tasks failed (bug? dependency down?). (2) Manual retry: after fixing the bug, move tasks from DLQ back to the main queue. (3) Bulk discard: delete DLQ entries after confirming they should not be retried. (4) Alerting: DLQ length > threshold triggers a PagerDuty alert. The DLQ acts as a circuit breaker — it prevents a broken worker or dependency from causing unbounded retries that mask the real problem.”}}]}

Uber system design covers task queues and background job processing. See common questions for Uber interview: task queue and background job system design.

Stripe system design covers task queues for async payment processing. Review patterns for Stripe interview: task queue and async payment processing design.

Atlassian system design covers task queues and job schedulers. See design patterns for Atlassian interview: task queue and job scheduling system design.