Task Queue System Low-Level Design

Requirements

  • Producers enqueue tasks; workers process them asynchronously
  • At-least-once delivery: tasks must not be lost on worker crash
  • Retry with backoff on task failure
  • Priority queues: high-priority tasks processed before low-priority
  • Scheduled tasks: execute at a specific future time (cron-style)
  • Dead letter queue: tasks that fail too many times go to a separate queue for inspection

Architecture

Producer → Task API → Queue Backend (Redis/SQS) → Worker Pool
                                                  → Retry Queue (on failure)
                                                  → Dead Letter Queue (max retries exceeded)

Data Model

Task(task_id UUID, type, payload JSON, priority INT, status ENUM(PENDING,RUNNING,DONE,FAILED),
     attempts INT, max_attempts INT, scheduled_at TIMESTAMP, visible_at TIMESTAMP,
     created_at, worker_id)

Queue Backend: Redis-Based

Use Redis sorted sets for priority + scheduling:

# Enqueue: score = priority * -1e12 + scheduled_at (lower score = higher priority)
ZADD task_queue {score} {task_id}
HSET task:{task_id} payload {json} type {type} attempts 0 max_attempts 3

# Dequeue (worker polling): atomically get and remove lowest-score task
# Use Lua script for atomic dequeue + lock:
ZPOPMIN task_queue 1  → returns [task_id, score]
# Set visibility timeout (re-add with future score if not acked)
ZADD task_inflight {now + 30s} {task_id}

Visibility timeout pattern (like SQS): when a worker dequeues a task, it’s moved to an inflight set with a timeout. If the worker crashes before completing, the task reappears when the timeout expires. The worker must ACK (delete from inflight) after successful completion.

At-Least-Once Delivery

Transaction flow: (1) Worker calls ZPOPMIN on the task queue. (2) Worker adds task to inflight set with score = now + visibility_timeout. (3) Worker processes the task. (4a) On success: DELETE task from inflight, UPDATE task status=DONE. (4b) On failure: increment attempts; if attempts < max_attempts, re-add to task queue with exponential backoff score (score += 2^attempts * base_delay); if attempts >= max_attempts, move to dead letter queue. A background job runs every 30 seconds: scan inflight set for tasks with score < NOW() (timed out), return them to the task queue. This ensures tasks are never lost on worker crash.

Priority Queues

Three approaches: (1) Multiple sorted sets (one per priority level). Worker checks high-priority queue first, falls back to lower. (2) Single sorted set with priority encoded in score: score = -priority * 10^12 + scheduled_at_unix. Higher priority = more negative score = popped first. (3) Weighted random selection: pull from high-priority 80% of the time, normal 20%. Prevents starvation of low-priority tasks when high-priority tasks are abundant.

Scheduled Tasks (Cron Jobs)

Scheduled tasks use a separate sorted set: key=scheduled_tasks, score=execute_at timestamp. A scheduler process (runs every 1s): ZRANGEBYSCORE scheduled_tasks 0 {now} LIMIT 100. For each due task: ZADD task_queue (computing priority score), enqueue the next occurrence if recurring. The scheduler must run in a single instance (or use a distributed lock) to prevent double-scheduling. Use a Redis lock: SET scheduler_lock {instance_id} NX PX 5000; release on each loop completion.

Dead Letter Queue

Tasks exceeding max_attempts are moved to a DLQ: key=dlq:{task_type}. DLQ stores: task_id, payload, failure_reason, last_error_at, attempt_count. Operations on DLQ: (1) Manual retry: move task back to main queue after investigating. (2) Bulk discard: delete all DLQ tasks of a given type after a fix is deployed. (3) Alerting: send alert when DLQ length exceeds threshold (task type has a systemic failure). DLQ tasks retained for 7 days for debugging.

Key Design Decisions

  • Visibility timeout pattern (not DELETE on dequeue) — ensures at-least-once delivery on worker crash
  • Exponential backoff on retry — prevents hammering a failing dependency
  • Separate sorted sets for scheduled tasks and active queue — clean separation of concerns
  • Dead letter queue — prevents repeatedly retrying a permanently broken task
  • Single scheduler instance with lock — prevents duplicate cron job execution

Uber system design covers task queues and background job processing. See common questions for Uber interview: task queue and background job system design.

Stripe system design covers task queues for async payment processing. Review patterns for Stripe interview: task queue and async payment processing design.

Atlassian system design covers task queues and job schedulers. See design patterns for Atlassian interview: task queue and job scheduling system design.

See also: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence

See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering

Scroll to Top