Requirements
- Producers enqueue tasks; workers process them asynchronously
- At-least-once delivery: tasks must not be lost on worker crash
- Retry with backoff on task failure
- Priority queues: high-priority tasks processed before low-priority
- Scheduled tasks: execute at a specific future time (cron-style)
- Dead letter queue: tasks that fail too many times go to a separate queue for inspection
Architecture
Producer → Task API → Queue Backend (Redis/SQS) → Worker Pool
→ Retry Queue (on failure)
→ Dead Letter Queue (max retries exceeded)
Data Model
Task(task_id UUID, type, payload JSON, priority INT, status ENUM(PENDING,RUNNING,DONE,FAILED),
attempts INT, max_attempts INT, scheduled_at TIMESTAMP, visible_at TIMESTAMP,
created_at, worker_id)
Queue Backend: Redis-Based
Use Redis sorted sets for priority + scheduling:
# Enqueue: score = priority * -1e12 + scheduled_at (lower score = higher priority)
ZADD task_queue {score} {task_id}
HSET task:{task_id} payload {json} type {type} attempts 0 max_attempts 3
# Dequeue (worker polling): atomically get and remove lowest-score task
# Use Lua script for atomic dequeue + lock:
ZPOPMIN task_queue 1 → returns [task_id, score]
# Set visibility timeout (re-add with future score if not acked)
ZADD task_inflight {now + 30s} {task_id}
Visibility timeout pattern (like SQS): when a worker dequeues a task, it’s moved to an inflight set with a timeout. If the worker crashes before completing, the task reappears when the timeout expires. The worker must ACK (delete from inflight) after successful completion.
At-Least-Once Delivery
Transaction flow: (1) Worker calls ZPOPMIN on the task queue. (2) Worker adds task to inflight set with score = now + visibility_timeout. (3) Worker processes the task. (4a) On success: DELETE task from inflight, UPDATE task status=DONE. (4b) On failure: increment attempts; if attempts < max_attempts, re-add to task queue with exponential backoff score (score += 2^attempts * base_delay); if attempts >= max_attempts, move to dead letter queue. A background job runs every 30 seconds: scan inflight set for tasks with score < NOW() (timed out), return them to the task queue. This ensures tasks are never lost on worker crash.
Priority Queues
Three approaches: (1) Multiple sorted sets (one per priority level). Worker checks high-priority queue first, falls back to lower. (2) Single sorted set with priority encoded in score: score = -priority * 10^12 + scheduled_at_unix. Higher priority = more negative score = popped first. (3) Weighted random selection: pull from high-priority 80% of the time, normal 20%. Prevents starvation of low-priority tasks when high-priority tasks are abundant.
Scheduled Tasks (Cron Jobs)
Scheduled tasks use a separate sorted set: key=scheduled_tasks, score=execute_at timestamp. A scheduler process (runs every 1s): ZRANGEBYSCORE scheduled_tasks 0 {now} LIMIT 100. For each due task: ZADD task_queue (computing priority score), enqueue the next occurrence if recurring. The scheduler must run in a single instance (or use a distributed lock) to prevent double-scheduling. Use a Redis lock: SET scheduler_lock {instance_id} NX PX 5000; release on each loop completion.
Dead Letter Queue
Tasks exceeding max_attempts are moved to a DLQ: key=dlq:{task_type}. DLQ stores: task_id, payload, failure_reason, last_error_at, attempt_count. Operations on DLQ: (1) Manual retry: move task back to main queue after investigating. (2) Bulk discard: delete all DLQ tasks of a given type after a fix is deployed. (3) Alerting: send alert when DLQ length exceeds threshold (task type has a systemic failure). DLQ tasks retained for 7 days for debugging.
Key Design Decisions
- Visibility timeout pattern (not DELETE on dequeue) — ensures at-least-once delivery on worker crash
- Exponential backoff on retry — prevents hammering a failing dependency
- Separate sorted sets for scheduled tasks and active queue — clean separation of concerns
- Dead letter queue — prevents repeatedly retrying a permanently broken task
- Single scheduler instance with lock — prevents duplicate cron job execution
Uber system design covers task queues and background job processing. See common questions for Uber interview: task queue and background job system design.
Stripe system design covers task queues for async payment processing. Review patterns for Stripe interview: task queue and async payment processing design.
Atlassian system design covers task queues and job schedulers. See design patterns for Atlassian interview: task queue and job scheduling system design.
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering