System Design Interview: Distributed Task Queue (Celery / SQS / Sidekiq)

Q: What is the visibility timeout in a message queue and why is it needed?

The visibility timeout is the period during which a dequeued message is hidden from other consumers. When a worker picks up a job, the broker marks it invisible for T seconds. If the worker successfully processes and acknowledges (deletes) the job within T seconds, the job is gone. If the worker crashes or exceeds T seconds, the job becomes visible again and another worker picks it up. This mechanism achieves at-least-once delivery: a job can be delivered multiple times (if a worker crashes after starting but before acknowledging), but a job is never lost. The visibility timeout must be longer than the maximum expected job execution time — if a job normally takes 25 seconds, set the timeout to 60+ seconds to avoid false re-deliveries. Workers processing long jobs should periodically extend the visibility timeout (heartbeat) to prevent re-delivery during normal slow execution.

Q: What is a dead-letter queue and how do you use it in production?

A dead-letter queue (DLQ) is a separate queue where messages are moved after they fail to process successfully after a maximum number of retries. When a message exhausts its retry count (typically 5 attempts with exponential backoff), it is moved to the DLQ instead of being discarded. The DLQ serves three purposes: (1) Investigation — inspect failed messages to understand why they are failing (bug in handler, malformed payload, dependency outage); (2) Alerting — set CloudWatch or Datadog alerts when DLQ depth exceeds a threshold; (3) Replay — after fixing the root cause, replay DLQ messages to the main queue for reprocessing. Always configure DLQ retention of 7-14 days to give engineers time to diagnose and fix issues before messages expire. Never discard messages from the DLQ without logging the payload and failure reason.

Q: How do you make a task queue job handler idempotent?

At-least-once delivery means any job can be delivered multiple times. Job handlers must be idempotent — running the same job twice produces the same result as running it once. Strategies: (1) Database unique constraint: use INSERT ... ON CONFLICT DO NOTHING with the job_id as the unique key. If the job already ran, the insert silently skips. (2) Check-and-set with Redis: SET job:{id}:done 1 NX EX 86400. If the key already exists, the job ran — return early. (3) Idempotency key with external APIs: Stripe, Twilio, and most payment APIs accept an Idempotency-Key header. Pass the job_id as this key — duplicate calls return the same response without charging again. (4) Upsert semantics: UPDATE ... WHERE id=X instead of INSERT — safe to run multiple times. The hardest case: non-idempotent side effects like sending an email. Use a sent_at timestamp column with UPDATE WHERE sent_at IS NULL — only the first successful attempt marks it sent.

⏱ 9 min read

A distributed task queue decouples work production from work execution. Instead of processing a request synchronously (blocking the HTTP response), the server enqueues a job that a worker picks up asynchronously. This enables: horizontal scaling of workers independently of API servers, retry on failure, deferral of expensive work (video encoding, email sending, report generation) to off-peak hours, and durability — jobs survive server crashes.

Requirements

Functional: Enqueue jobs from any producer. Workers pick up and execute jobs. Failed jobs retry with backoff. Job results are queryable. Support job priorities and scheduled/delayed jobs.

Non-functional: At-least-once delivery (no lost jobs). Horizontal scaling to 10K+ workers. Sub-second job pickup latency for high-priority work. Job durability: survive broker restarts.

Core Components

Producer: the API server or service that creates jobs. Serializes the job payload (function name, arguments, metadata) and publishes to the broker.

Broker: the message store (Redis, RabbitMQ, Amazon SQS, Kafka). Holds jobs between producer and workers. Provides ordering, deduplication, and persistence guarantees depending on the implementation.

Worker: pulls jobs from the broker, executes them, and acknowledges success or reports failure. Workers are stateless and horizontally scalable.

Result backend: stores job results and status (pending, started, success, failure, retried). Redis or a relational database.

Job Lifecycle and Visibility Timeout

The visibility timeout is the core mechanism for at-least-once delivery. When a worker picks a job, the broker makes it invisible to other workers for T seconds (visibility timeout). If the worker crashes before acknowledging, the job becomes visible again after T seconds and is re-delivered. If the worker succeeds, it deletes the job from the queue. This prevents duplicate execution while ensuring no job is lost.

class Worker:
    def __init__(self, queue, visibility_timeout=30):
        self.queue = queue
        self.visibility_timeout = visibility_timeout

    def run(self):
        while True:
            job = self.queue.receive(visibility_timeout=self.visibility_timeout)
            if not job:
                time.sleep(0.1)
                continue
            try:
                result = execute(job.function, job.args)
                self.queue.delete(job)  # acknowledge success
                store_result(job.id, result)
            except Exception as e:
                # Job becomes visible again after timeout for retry
                log_failure(job.id, e)
                if job.retry_count >= MAX_RETRIES:
                    self.queue.send_to_dlq(job)
                    self.queue.delete(job)

Retry and Exponential Backoff

Jobs that fail should retry with exponential backoff to avoid thundering herds. The delay between retries grows geometrically: 30s, 60s, 120s, 240s, 480s. After max retries (typically 5), move to the dead-letter queue (DLQ).

def retry_delay(attempt: int) -> int:
    # Base 30s, max 30min, with jitter
    import random
    delay = min(30 * (2 ** attempt), 1800)
    jitter = random.randint(0, delay // 4)
    return delay + jitter

Jitter prevents all failed jobs from retrying simultaneously after a broker restart (retry storm).

Dead-Letter Queue

Jobs that exhaust retries are moved to a DLQ. The DLQ is a separate queue for manual inspection and replay. Operators analyze DLQ messages to identify systemic bugs. After fixing the root cause, DLQ messages are replayed to the main queue. Always configure DLQ retention (7-14 days) and alerting when DLQ depth grows above threshold.

Priority Queues

Implement priority by maintaining separate queues per priority level. Workers poll high-priority queues first:

QUEUES = ["high_priority", "default", "low_priority"]

def get_next_job(queues):
    for queue_name in queues:
        job = dequeue(queue_name, timeout=0)  # non-blocking
        if job:
            return job
    return dequeue("low_priority", timeout=1)  # block on lowest only

Celery implements this with CELERY_QUEUE_HA_POLICY. SQS does not natively support priorities — use separate SQS queues per priority.

Scheduled and Delayed Jobs

Store delayed jobs in a Redis sorted set keyed by scheduled execution time. A scheduler process (Celery Beat, Sidekiq-Cron) runs every second, queries for jobs with scheduled_at <= now, and moves them to the main queue:

def scheduler_loop():
    while True:
        now = time.time()
        # ZRANGEBYSCORE: get all jobs scheduled <= now
        jobs = redis.zrangebyscore("scheduled_jobs", 0, now)
        for job in jobs:
            redis.zrem("scheduled_jobs", job)
            enqueue(deserialize(job))
        time.sleep(1.0)

At-Least-Once vs Exactly-Once

At-least-once delivery (the default) means jobs may execute more than once if a worker crashes after executing but before acknowledging. Design job handlers to be idempotent — safe to run multiple times with the same input. Common patterns: INSERT OR IGNORE with a unique job_id in the database, check-and-set with a Redis key, or using the job_id as an idempotency key for external API calls.

Exactly-once delivery requires distributed transactions or two-phase commit between the queue acknowledgment and the database write. Kafka supports this with its transactional producer API (BEGIN_TRANSACTION, write to topic, commit with offset). The cost is 3-5x latency overhead — use only when duplicate execution has real consequences (billing, payment processing).

Broker Comparison

Redis (Celery/Sidekiq broker): Sub-millisecond enqueue/dequeue. In-memory with optional AOF persistence. Not as durable as dedicated brokers. Best for jobs that can tolerate occasional redelivery.
RabbitMQ: AMQP protocol, routing, exchange patterns, message TTL, DLX (dead-letter exchange). Best for complex routing and per-message priority. Requires careful queue configuration for durability.
Amazon SQS: Fully managed, auto-scales, 14-day message retention, FIFO queues for exactly-once. 250K messages free/month. Best for AWS-native applications without operational overhead.
Kafka: Log-based, consumer groups, replay from offset, 7-day retention. Best for event streaming, fan-out to multiple consumers, and audit trails. Overkill for simple job queues.

Scaling to 10K Workers

Workers are stateless — scale horizontally. The broker is the bottleneck. Redis cluster partitions queues across nodes. SQS scales automatically. For Redis: use one broker per queue family, avoid a single hot queue. Optimize prefetch — workers should batch-fetch 10-20 jobs at once (long polling) to reduce broker RPCs. Monitor: queue depth, job age (time from enqueue to start), worker throughput, DLQ depth.

Frequently Asked Questions

What is the visibility timeout in a message queue and why is it needed?

The visibility timeout is the period during which a dequeued message is hidden from other consumers. When a worker picks up a job, the broker marks it invisible for T seconds. If the worker successfully processes and acknowledges (deletes) the job within T seconds, the job is gone. If the worker crashes or exceeds T seconds, the job becomes visible again and another worker picks it up. This mechanism achieves at-least-once delivery: a job can be delivered multiple times (if a worker crashes after starting but before acknowledging), but a job is never lost. The visibility timeout must be longer than the maximum expected job execution time — if a job normally takes 25 seconds, set the timeout to 60+ seconds to avoid false re-deliveries. Workers processing long jobs should periodically extend the visibility timeout (heartbeat) to prevent re-delivery during normal slow execution.

What is a dead-letter queue and how do you use it in production?

A dead-letter queue (DLQ) is a separate queue where messages are moved after they fail to process successfully after a maximum number of retries. When a message exhausts its retry count (typically 5 attempts with exponential backoff), it is moved to the DLQ instead of being discarded. The DLQ serves three purposes: (1) Investigation — inspect failed messages to understand why they are failing (bug in handler, malformed payload, dependency outage); (2) Alerting — set CloudWatch or Datadog alerts when DLQ depth exceeds a threshold; (3) Replay — after fixing the root cause, replay DLQ messages to the main queue for reprocessing. Always configure DLQ retention of 7-14 days to give engineers time to diagnose and fix issues before messages expire. Never discard messages from the DLQ without logging the payload and failure reason.

How do you make a task queue job handler idempotent?

At-least-once delivery means any job can be delivered multiple times. Job handlers must be idempotent — running the same job twice produces the same result as running it once. Strategies: (1) Database unique constraint: use INSERT … ON CONFLICT DO NOTHING with the job_id as the unique key. If the job already ran, the insert silently skips. (2) Check-and-set with Redis: SET job:{id}:done 1 NX EX 86400. If the key already exists, the job ran — return early. (3) Idempotency key with external APIs: Stripe, Twilio, and most payment APIs accept an Idempotency-Key header. Pass the job_id as this key — duplicate calls return the same response without charging again. (4) Upsert semantics: UPDATE … WHERE id=X instead of INSERT — safe to run multiple times. The hardest case: non-idempotent side effects like sending an email. Use a sent_at timestamp column with UPDATE WHERE sent_at IS NULL — only the first successful attempt marks it sent.

Netflix Interview Guide

Databricks Interview Guide

Airbnb Interview Guide

Stripe Interview Guide

Companies That Ask This Question

Uber Engineering Interview Guide

LinkedIn Engineering Interview Guide

Shopify Engineering Interview Guide

DoorDash Engineering Interview Guide

Atlassian Engineering Interview Guide