Low Level Design: Distributed Task Queue

⏱ 3 min read

A distributed task queue decouples work producers from workers, enabling async processing, retries, scheduling, and horizontal scaling. Common use cases: sending transactional emails, processing image uploads, running nightly data exports, and fanning out webhook deliveries. Systems like Celery, Sidekiq, BullMQ, Temporal, and Amazon SQS implement variations of this pattern. Understanding the internals is a frequent system design interview topic at companies with significant async workloads.

Core Architecture

A task queue has three components: producers (enqueue tasks), brokers (store and route tasks), and workers (execute tasks). The broker is the central coordination point. Simple brokers use Redis (list-based queues with BRPOPLPUSH for reliable consumption). Production-grade brokers use Kafka (durable, replayable) or RabbitMQ (routing, priorities). The choice depends on: durability requirements, delivery semantics (at-least-once vs. exactly-once), and whether tasks need replay.

-- Redis-based reliable queue (BRPOPLPUSH pattern)
-- Producer:
LPUSH queue:email '{"task":"send_email","to":"user@example.com","template":"welcome"}'

-- Worker (atomic move from queue to processing list):
task = BRPOPLPUSH queue:email queue:processing 30  -- blocks up to 30s

-- On success, remove from processing:
LREM queue:processing 1 task

-- On failure or timeout, task remains in queue:processing
-- Reaper job moves stale items back to queue:email:
-- LRANGE queue:processing 0 -1 → check timestamps → RPOPLPUSH back

Reliable Delivery: At-Least-Once Semantics

The key insight: task queues cannot guarantee exactly-once execution without distributed transactions. Standard approach is at-least-once delivery with idempotent task handlers. A task is not acknowledged until the worker successfully completes it. If the worker crashes mid-execution, the task becomes visible again after a visibility timeout (SQS) or is moved back by a reaper job (Redis). Task handlers must be idempotent: running the same task twice produces the same result. Use the task ID as an idempotency key: check before processing, store the result.

Retry Logic and Exponential Backoff

Failed tasks should retry with exponential backoff to avoid thundering herd against a recovering downstream service. Standard retry schedule: 1 minute, 5 minutes, 30 minutes, 4 hours, 24 hours. After max retries (typically 5), move to a dead-letter queue (DLQ) for manual inspection. Store retry count and next retry time in the task payload. A scheduler process polls for tasks where next_retry_at <= NOW() and re-enqueues them. DLQ alerts on-call when items accumulate — indicates a systematic failure needing investigation.

func retryDelay(attempt int) time.Duration {
    // Exponential backoff with jitter
    base := time.Minute
    delay := base * time.Duration(math.Pow(5, float64(attempt-1)))
    // Add 10% random jitter to spread retries
    jitter := time.Duration(rand.Int63n(int64(delay) / 10))
    return delay + jitter
}
// attempt 1: ~1min, 2: ~5min, 3: ~25min, 4: ~2hr, 5: ~10hr

Priority Queues and Queue Routing

Not all tasks are equally urgent. Implement multiple queues with different priority levels: queue:critical (password reset emails), queue:normal (notification digests), queue:batch (report generation). Workers poll queues in priority order: always drain critical before normal, normal before batch. With Redis, use sorted sets (ZADD with priority score) for priority queues. Alternatively, run dedicated worker pools per queue tier — critical tasks get 20 workers, batch gets 3 workers — preventing batch tasks from starving critical ones.

Scheduled and Recurring Tasks

For cron-style recurring tasks, use a separate scheduler process (not workers). The scheduler reads a schedule table and enqueues tasks at the right time. Exactly-once scheduling is critical: if two scheduler instances both fire, the task runs twice. Solutions: use a distributed lock (Redis SETNX) held by only one scheduler instance (leader election), or use a database unique constraint on (task_name, scheduled_for) to reject duplicate enqueue attempts. Temporal and Celery Beat implement this pattern.

Key Interview Discussion Points

Visibility timeout tuning: too short and tasks re-execute before completion; too long and failed workers create long delays before retry
Worker autoscaling: scale worker count based on queue depth metric (items in queue / target processing rate = target workers)
Task serialization: JSON vs. MessagePack vs. Protobuf for payload; consider schema evolution when tasks may be queued before a deployment
Workflow orchestration vs. simple queues: Temporal/Conductor for multi-step workflows with compensation logic; SQS/Redis for simple fire-and-forget tasks
Poison pill messages: tasks that always fail (bad data) fill the DLQ; detect by tracking failure rate per task type