Distributed Task Queue: Low-Level Design

A distributed task queue decouples work from workers: producers enqueue tasks, workers consume and execute them asynchronously. This enables horizontal scaling of compute-intensive work (image processing, email sending, ML inference), provides retry logic for transient failures, and prevents slow work from blocking the API request path. Celery, Sidekiq, and AWS SQS are common implementations.

Core Components

Broker: the message queue that stores tasks between production and consumption. Common brokers: Redis (fast, in-memory, supports pub/sub and sorted sets for delayed tasks), RabbitMQ (AMQP protocol, message acknowledgments, dead letter exchanges), Kafka (high-throughput, replay, but more complex). Producer: API server code that enqueues tasks when work should happen asynchronously (user uploads image → enqueue resize task). Worker: processes that continuously poll the broker, execute tasks, and acknowledge completion. Workers scale horizontally — add more workers to increase throughput. Result backend: optional store for task results that producers can poll (Redis, database). Required if the caller needs to know when the task completed or what it returned.

Task Lifecycle

(1) Producer enqueues task: serialize task payload (function name, arguments) and push to the broker queue. Record the task_id. (2) Worker polls for tasks: worker pulls task from the queue, acquires it (message is locked/invisible to other workers). (3) Worker executes: runs the task function with the payload arguments. (4) On success: worker acknowledges the message (deletes from queue or moves to completed). Update task status to SUCCEEDED in the result backend. (5) On failure: if retries remain, re-enqueue with exponential backoff delay. If retries exhausted, move to dead letter queue, update status to FAILED. The message lock/invisibility period (SQS: visibility timeout) must be longer than the maximum expected task duration — if it expires before completion, the task is re-delivered (duplicate execution risk).

Retry Strategy

Retry with exponential backoff and jitter: delay before retry attempt k = min(base_delay * 2^k + random_jitter, max_delay). Example: attempt 1 → 30s, attempt 2 → 60s, attempt 3 → 120s, attempt 4 → 240s, attempt 5 → dead letter queue. Jitter prevents all retrying tasks from hitting the downstream service simultaneously after a mass failure. Track the retry count in the task payload or a separate store. Classify failures: retryable (network timeout, 503 from downstream service) vs. non-retryable (invalid input, 400 validation error). Immediately route non-retryable failures to the dead letter queue without retrying — retrying a 400 error wastes time and makes the situation worse.

Priority Queues

Not all tasks are equal: a password reset email is more urgent than a weekly newsletter batch. Implement priority with separate queues (high_priority, normal, low_priority) and configure workers to prefer higher-priority queues. Worker configuration: 70% of workers poll high_priority first, 20% poll normal, 10% poll low_priority. Or use a weighted round-robin: process 5 high-priority tasks per 1 low-priority task. Redis sorted sets support priority natively: ZADD queue {priority_score} {task_id}; workers ZPOPMAX to get the highest-priority task. RabbitMQ has native priority queue support (0-255 priority levels per message).

Scheduled and Delayed Tasks

Scheduled tasks (run this at 3am daily) and delayed tasks (send this email in 2 hours) require a separate mechanism from the standard FIFO queue. Redis sorted sets: store tasks in a sorted set with the execution timestamp as the score. A scheduler process runs every second: ZRANGEBYSCORE queue 0 {current_time} — retrieves tasks that are due; ZPOPMIN atomically removes them; pushes them to the worker queue. This is how Celery’s “ETA” feature works with Redis as the broker. At-most-once vs. at-least-once: if the scheduler crashes after ZRANGEBYSCORE but before ZPOPMIN, the task is executed twice. Use ZPOPMIN (atomic pop) rather than a two-step ZRANGEBYSCORE+ZREM to minimize the window.

See also: Meta Interview Guide 2026: Facebook, Instagram, WhatsApp Engineering

See also: Uber Interview Guide 2026: Dispatch Systems, Geospatial Algorithms, and Marketplace Engineering

See also: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence

See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering

See also: LinkedIn Interview Guide 2026: Social Graph Engineering, Feed Ranking, and Professional Network Scale

See also: Airbnb Interview Guide 2026: Search Systems, Trust and Safety, and Full-Stack Engineering

See also: Databricks Interview Guide 2026: Spark Internals, Delta Lake, and Lakehouse Architecture

See also: Anthropic Interview Guide 2026: Process, Questions, and AI Safety

See also: Atlassian Interview Guide

See also: Coinbase Interview Guide

See also: Shopify Interview Guide

See also: Snap Interview Guide

See also: Lyft Interview Guide 2026: Rideshare Engineering, Real-Time Dispatch, and Safety Systems

See also: Stripe Interview Guide 2026: Process, Bug Bash Round, and Payment Systems

Scroll to Top