Why Task Queues?
Task queues decouple work production from work execution. Instead of processing a long-running task synchronously (blocking the HTTP response for 30 seconds), you push the task to a queue and return a job ID immediately. Background workers pull from the queue and execute the task asynchronously. Common use cases: sending emails/SMS, generating PDF reports, image/video processing, ML model inference, ETL jobs, webhook delivery, and scheduled maintenance tasks. The HTTP response time drops from 30 seconds to milliseconds; the user experience improves dramatically.
Core Architecture
# Components:
# 1. Producer: API server that enqueues jobs
# 2. Broker: message queue that stores jobs (Redis, SQS, RabbitMQ, Kafka)
# 3. Workers: processes that pull and execute jobs
# 4. Result store: where job results are stored (Redis, database)
# 5. Scheduler: handles delayed and recurring jobs (cron)
# Basic Celery example (Python):
from celery import Celery
app = Celery("tasks", broker="redis://localhost:6379/0",
backend="redis://localhost:6379/1")
@app.task(bind=True, max_retries=3)
def send_email(self, user_id: int, template: str):
try:
user = db.get_user(user_id)
email_service.send(user.email, template)
except EmailException as e:
raise self.retry(exc=e, countdown=60) # retry after 60s
# Producer (in Django view):
send_email.delay(user_id=42, template="welcome") # enqueues, returns AsyncResult
# Delayed execution:
send_email.apply_async(args=[42, "followup"], countdown=3600) # 1 hour from now
# Periodic tasks (cron):
app.conf.beat_schedule = {
"daily-report": {
"task": "tasks.generate_report",
"schedule": crontab(hour=9, minute=0), # 9 AM daily
},
}
Delivery Semantics
The most critical design decision for any task queue is its delivery guarantee:
- At-most-once: task may be lost if a worker crashes mid-execution. The queue deletes the message before executing. Used when task loss is acceptable (analytics events, non-critical notifications). No retry on failure.
- At-least-once: task will eventually execute but may execute more than once. The queue keeps the message until the worker acknowledges success. If the worker crashes, the message becomes visible again and another worker picks it up. This is the standard for most task queues (Celery with Redis/SQS, Sidekiq). Requires idempotent task implementations.
- Exactly-once: task executes exactly once despite failures. Requires two-phase commit or distributed transactions. Very hard to achieve; Temporal’s workflow engine approximates it using durable event sourcing.
# Making tasks idempotent (required for at-least-once):
@app.task
def charge_user(payment_id: str):
# Check if already processed (idempotency key = payment_id)
if db.exists(f"processed_payment:{payment_id}"):
return {"status": "already_processed"}
with db.transaction():
payment = db.get_payment(payment_id)
if payment.status == "pending":
stripe.charge(payment)
db.update_payment(payment_id, status="charged")
db.set(f"processed_payment:{payment_id}", True, ex=86400)
return {"status": "charged"}
Priority Queues
Different tasks have different urgency. Email delivery (non-urgent) should not block password reset emails (urgent). Implement priority via multiple queues with different worker allocations:
# Celery worker consuming multiple queues with priority:
celery -A tasks worker --queues=critical,high,normal,low -c 10
# Worker processes queues in declared order — critical tasks processed first
# Queue-based priority in practice:
CELERY_TASK_QUEUES = {
"critical": Queue("critical", routing_key="critical"), # 4 workers
"high": Queue("high", routing_key="high"), # 3 workers
"normal": Queue("normal", routing_key="normal"), # 2 workers
"low": Queue("low", routing_key="low"), # 1 worker
}
# Route tasks to queues by type:
@app.task(queue="critical")
def send_password_reset(user_id: int): ...
@app.task(queue="low")
def generate_monthly_report(month: str): ...
Delay Queues and Scheduled Jobs
Delay queues hold messages until a specified future time. SQS supports message visibility timeout (up to 15 minutes) and delay queues natively. For longer delays (hours/days), use a different mechanism:
# Redis sorted set as delay queue:
# Score = Unix timestamp when the job should become visible
import redis, time, json
r = redis.Redis()
def enqueue_delayed(job: dict, delay_seconds: int):
execute_at = time.time() + delay_seconds
r.zadd("delayed_jobs", {json.dumps(job): execute_at})
def poll_ready_jobs():
now = time.time()
# Get all jobs with score <= now (ready to execute)
jobs = r.zrangebyscore("delayed_jobs", 0, now, start=0, num=100)
if jobs:
pipe = r.pipeline()
pipe.zremrangebyscore("delayed_jobs", 0, now)
pipe.execute()
return [json.loads(j) for j in jobs]
# Poll every second, move ready jobs to the normal task queue
# This is how Sidekiq Scheduler and Celery Beat work internally
Dead Letter Queue (DLQ)
When a job fails all retry attempts, it should not be silently discarded. A dead letter queue (DLQ) captures all permanently failed jobs for inspection and replay:
# SQS DLQ setup:
# Main queue: maxReceiveCount=5 → after 5 delivery attempts, move to DLQ
# DLQ: separate queue for failed messages, with long retention (14 days)
# Celery with DLQ pattern:
@app.task(bind=True, max_retries=5,
acks_late=True, # acknowledge only after successful execution
reject_on_worker_lost=True) # re-queue if worker crashes mid-task
def process_webhook(self, payload: dict):
try:
handle_webhook(payload)
except TemporaryError as e:
raise self.retry(exc=e, countdown=2 ** self.request.retries) # exponential backoff
except PermanentError as e:
# Do not retry — send to DLQ
dlq.publish({"task": "process_webhook", "payload": payload, "error": str(e)})
return # mark as complete (not retried)
# DLQ monitoring: alert when DLQ depth > threshold
# DLQ replay: fix the bug, then reprocess jobs from DLQ
Temporal: Durable Workflows
Temporal is a workflow engine for long-running, stateful processes. Unlike traditional task queues where you must manually handle retries, timeouts, and state persistence, Temporal provides durable execution: the workflow code is a normal Python/Go/Java function, and Temporal automatically persists its progress. If the server crashes mid-workflow, Temporal replays the event history to resume exactly where it left off.
# Temporal workflow: order fulfillment (days-long process)
@workflow.defn
class OrderFulfillmentWorkflow:
@workflow.run
async def run(self, order_id: str) -> str:
# Each step is an activity — retried automatically on failure
# Workflow state persisted after every activity completion
# Step 1: charge payment (with automatic retry on transient failures)
charge_id = await workflow.execute_activity(
charge_payment,
order_id,
start_to_close_timeout=timedelta(seconds=30),
retry_policy=RetryPolicy(max_attempts=3)
)
# Step 2: wait for inventory confirmation (could take hours)
inventory_confirmed = await workflow.execute_activity(
reserve_inventory,
order_id,
schedule_to_close_timeout=timedelta(hours=24) # wait up to 24h
)
# Step 3: ship (conditional)
if inventory_confirmed:
tracking = await workflow.execute_activity(create_shipment, order_id)
return tracking
else:
await workflow.execute_activity(refund_payment, charge_id)
return "refunded"
# If this process runs for 3 days and the server crashes on day 2,
# Temporal replays the event history and resumes from the last completed activity.
Scaling Workers
Workers are stateless — scale horizontally by adding instances. Queue depth is the primary scaling signal: if queue depth grows above a threshold, spin up more workers. Autoscaling approaches: Kubernetes HPA with KEDA (scale on SQS queue depth or Kafka consumer lag); ECS Service Auto Scaling with CloudWatch queue metrics; Celery autoscale (workers manage their own concurrency based on queue depth).
# KEDA ScaledObject for SQS-based autoscaling:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: email-worker-scaler
spec:
scaleTargetRef:
name: email-worker
minReplicaCount: 1
maxReplicaCount: 50
triggers:
- type: aws-sqs-queue
metadata:
queueURL: https://sqs.us-east-1.amazonaws.com/xxx/email-queue
queueLength: "100" # 1 replica per 100 messages in queue
awsRegion: "us-east-1"
Interview Questions
- Design a job scheduler that runs millions of cron jobs at precise times
- How do you ensure exactly-once execution for a payment-processing job?
- A queue is backing up — jobs are being enqueued faster than workers can process them. What do you do?
- Design a system that sends scheduled emails to users at their local 9 AM
- How do you handle a job that takes 6 hours and a worker that crashes after 4 hours?
Frequently Asked Questions
What is the difference between at-least-once and exactly-once task execution?
At-least-once execution guarantees a task will eventually complete but may execute more than once. This happens because the queue keeps a message until it receives an acknowledgment from the worker. If the worker crashes after completing the task but before sending the ack, the queue re-delivers the message to another worker — the task executes twice. At-least-once is the default for most task queues (Celery, Sidekiq, SQS). To use at-least-once safely, tasks must be idempotent: running them twice produces the same result as running them once. For many tasks this is natural (sending an email with idempotency key, updating a record to a specific value, inserting with ON CONFLICT DO NOTHING). At-most-once execution deletes the message before executing — if the worker crashes, the task is lost. Used when task loss is preferable to duplication (analytics events, best-effort notifications). Exactly-once is the hardest guarantee: the task runs exactly once even under crashes and network failures. It requires atomic coordination between consuming the message and recording the result — typically via two-phase commit or distributed transactions. Temporal achieves durable exactly-once semantics through event sourcing: every activity's result is persisted before the workflow advances, so crashes cause replay rather than re-execution. The practical advice: design tasks to be idempotent and use at-least-once — it is simpler to implement, scales better, and is sufficient for 99% of use cases.
How do you design a cron scheduler that handles millions of scheduled jobs reliably?
A naive cron scheduler runs a single process that evaluates all schedules every minute — this breaks at scale (millions of schedules × CPU per evaluation × network to fetch schedules). A scalable design uses a distributed architecture: (1) Schedule storage: store all cron schedules in a database (job_id, cron_expression, last_run_at, next_run_at, payload). The next_run_at field is pre-computed when a schedule is saved or after each execution. (2) Polling with sharding: multiple scheduler instances each own a shard of job IDs (consistent hashing or explicit shard assignment). Each instance polls its shard: SELECT * FROM cron_jobs WHERE next_run_at <= NOW() AND shard_id = ? LIMIT 1000. At scheduled time, it enqueues the job into the task queue and updates next_run_at = compute_next_run(cron_expression, NOW()). (3) Leader election for singletons: some jobs must run on exactly one node (e.g., database cleanup). Use Redis SETNX or ZooKeeper for leader election — only the leader polls and enqueues singleton jobs. (4) Clock skew tolerance: different machines may disagree on current time by 1-2 seconds. Use >= (next_run_at <= NOW() + 1 second) to catch jobs that should have run but haven't. (5) Delayed execution handling: if the scheduler is down for 10 minutes, jobs that were due during that window must be caught up. A "missed window" policy decides whether to run them now or skip them. Airflow's catchup=True mode re-runs all missed schedule intervals; most task schedulers run only the latest missed instance.
How does Temporal differ from a traditional task queue like Celery or Sidekiq?
Traditional task queues (Celery, Sidekiq, Resque, BullMQ) are simple: a producer enqueues a message, a worker dequeues and executes it, the result is optionally stored. They excel at simple, short-duration background tasks but have limitations for complex workflows: you must manually implement retry logic, timeout handling, and state persistence. If a multi-step workflow crashes after step 3 of 10, you must manually figure out what to replay. Temporal is a durable workflow engine. It solves the coordination problem: workflows in Temporal are normal code (Python, Go, Java, TypeScript) that appears to run sequentially, but Temporal persists the entire execution history to a distributed database (Cassandra or MySQL). Every activity (unit of work) and every workflow state transition is recorded. If the Temporal Worker crashes mid-execution, on restart the workflow replays its event history to resume exactly where it stopped — without the developer writing any persistence or replay logic. Key differences: (1) Workflows can be arbitrarily long-running (days, months) — Temporal handles persistence, resumability, and timer callbacks. (2) Complex control flow (if/else, loops, waiting for human approval, racing multiple activities) is expressed in regular code. (3) Temporal provides a full execution history visible in its UI — debugging failed workflows shows exactly which activity failed and with what error. Use Celery/Sidekiq for: simple background jobs, high-throughput short tasks, teams already familiar with them. Use Temporal for: multi-step workflows, workflows that span minutes to months, financial operations requiring exactly-once semantics, or any workflow requiring human-in-the-loop steps.
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What is the difference between at-least-once and exactly-once task execution?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “At-least-once execution guarantees a task will eventually complete but may execute more than once. This happens because the queue keeps a message until it receives an acknowledgment from the worker. If the worker crashes after completing the task but before sending the ack, the queue re-delivers the message to another worker — the task executes twice. At-least-once is the default for most task queues (Celery, Sidekiq, SQS). To use at-least-once safely, tasks must be idempotent: running them twice produces the same result as running them once. For many tasks this is natural (sending an email with idempotency key, updating a record to a specific value, inserting with ON CONFLICT DO NOTHING). At-most-once execution deletes the message before executing — if the worker crashes, the task is lost. Used when task loss is preferable to duplication (analytics events, best-effort notifications). Exactly-once is the hardest guarantee: the task runs exactly once even under crashes and network failures. It requires atomic coordination between consuming the message and recording the result — typically via two-phase commit or distributed transactions. Temporal achieves durable exactly-once semantics through event sourcing: every activity’s result is persisted before the workflow advances, so crashes cause replay rather than re-execution. The practical advice: design tasks to be idempotent and use at-least-once — it is simpler to implement, scales better, and is sufficient for 99% of use cases.”
}
},
{
“@type”: “Question”,
“name”: “How do you design a cron scheduler that handles millions of scheduled jobs reliably?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A naive cron scheduler runs a single process that evaluates all schedules every minute — this breaks at scale (millions of schedules × CPU per evaluation × network to fetch schedules). A scalable design uses a distributed architecture: (1) Schedule storage: store all cron schedules in a database (job_id, cron_expression, last_run_at, next_run_at, payload). The next_run_at field is pre-computed when a schedule is saved or after each execution. (2) Polling with sharding: multiple scheduler instances each own a shard of job IDs (consistent hashing or explicit shard assignment). Each instance polls its shard: SELECT * FROM cron_jobs WHERE next_run_at = (next_run_at <= NOW() + 1 second) to catch jobs that should have run but haven't. (5) Delayed execution handling: if the scheduler is down for 10 minutes, jobs that were due during that window must be caught up. A "missed window" policy decides whether to run them now or skip them. Airflow's catchup=True mode re-runs all missed schedule intervals; most task schedulers run only the latest missed instance."
}
},
{
"@type": "Question",
"name": "How does Temporal differ from a traditional task queue like Celery or Sidekiq?",
"acceptedAnswer": {
"@type": "Answer",
"text": "Traditional task queues (Celery, Sidekiq, Resque, BullMQ) are simple: a producer enqueues a message, a worker dequeues and executes it, the result is optionally stored. They excel at simple, short-duration background tasks but have limitations for complex workflows: you must manually implement retry logic, timeout handling, and state persistence. If a multi-step workflow crashes after step 3 of 10, you must manually figure out what to replay. Temporal is a durable workflow engine. It solves the coordination problem: workflows in Temporal are normal code (Python, Go, Java, TypeScript) that appears to run sequentially, but Temporal persists the entire execution history to a distributed database (Cassandra or MySQL). Every activity (unit of work) and every workflow state transition is recorded. If the Temporal Worker crashes mid-execution, on restart the workflow replays its event history to resume exactly where it stopped — without the developer writing any persistence or replay logic. Key differences: (1) Workflows can be arbitrarily long-running (days, months) — Temporal handles persistence, resumability, and timer callbacks. (2) Complex control flow (if/else, loops, waiting for human approval, racing multiple activities) is expressed in regular code. (3) Temporal provides a full execution history visible in its UI — debugging failed workflows shows exactly which activity failed and with what error. Use Celery/Sidekiq for: simple background jobs, high-throughput short tasks, teams already familiar with them. Use Temporal for: multi-step workflows, workflows that span minutes to months, financial operations requiring exactly-once semantics, or any workflow requiring human-in-the-loop steps."
}
}
]
}