Question 1

What are the components of a distributed task queue?

Accepted Answer

A distributed task queue has four components: (1) Broker: the message queue storing tasks between production and consumption. Common choices: Redis (fast, in-memory, supports delayed tasks via sorted sets), RabbitMQ (AMQP, message acknowledgments, dead letter exchanges), SQS (managed, at-least-once, visibility timeout). (2) Producer: code that enqueues tasks — API servers that trigger async work (image resize on upload, email on registration). (3) Worker: processes that poll the broker, execute task functions, and acknowledge completion. Workers scale horizontally — add more to increase throughput. (4) Result backend (optional): store for task results that producers can poll — required if callers need to know when tasks complete or what they returned. Redis or a database serve as result backends.

Question 2

How does the task visibility timeout prevent duplicate execution?

Accepted Answer

When a worker receives a task from SQS or RabbitMQ, the message becomes invisible to other workers for the visibility timeout period (SQS default: 30 seconds). If the worker processes and acknowledges (deletes) the task within the timeout, no other worker sees it. If the worker crashes or takes too long, the visibility timeout expires and the message becomes visible again — another worker picks it up. This implements at-least-once delivery: tasks may be executed more than once if the worker crashes after execution but before acknowledgment. To prevent duplicate effects: make task functions idempotent (safe to run twice). Set the visibility timeout longer than the maximum expected task duration plus processing headroom — if image resize takes 60 seconds max, set timeout to 90-120 seconds.

Question 3

What retry strategy should a task queue use?

Accepted Answer

Retry with exponential backoff and jitter: delay before attempt k = min(base * 2^k + random(0, base), max_delay). Example with base=30s, max=300s: attempt 1 → 30-60s, attempt 2 → 60-90s, attempt 3 → 120-150s, attempt 4 → 300s, then dead letter queue. Jitter prevents synchronized retry storms (all failed tasks retrying simultaneously after a downstream outage recovery). Classify failure types: retryable (503, network timeout, temporary resource unavailability) vs. non-retryable (400 validation error, 404 not found, business logic rejection). Route non-retryable failures directly to the dead letter queue without burning retry attempts. Track retry count in the task payload — increment on each attempt and check against max_retries before re-enqueuing.

Question 4

How do you implement scheduled and delayed tasks using Redis?

Accepted Answer

Redis sorted sets implement delayed execution: store tasks as (score=execution_timestamp, member=serialized_task). To schedule a task for 2 hours from now: ZADD scheduled_tasks {now+7200} {task_json}. A scheduler process runs every second: ZRANGEBYSCORE scheduled_tasks 0 {now} LIMIT 0 10 identifies due tasks; ZPOPMIN scheduled_tasks retrieves and atomically removes the highest-priority (earliest) due task; push it to the worker queue. ZPOPMIN is atomic — prevents two scheduler instances from both processing the same task. For recurring tasks (run every night at 3am): after executing, re-schedule by ZADD with the next execution timestamp. This is how Celery's ETA and countdown features work with a Redis broker. At-most-once risk: if the scheduler crashes between ZRANGEBYSCORE and ZPOPMIN, the task is delayed until the next scheduler tick — acceptable.

Distributed Task Queue: Low-Level Design

Core Components

Task Lifecycle

Retry Strategy

Priority Queues

Scheduled and Delayed Tasks