Q: How does Temporal differ from a traditional task queue like Celery or Sidekiq?

Traditional task queues (Celery, Sidekiq, Resque, BullMQ) are simple: a producer enqueues a message, a worker dequeues and executes it, the result is optionally stored. They excel at simple, short-duration background tasks but have limitations for complex workflows: you must manually implement retry logic, timeout handling, and state persistence. If a multi-step workflow crashes after step 3 of 10, you must manually figure out what to replay. Temporal is a durable workflow engine. It solves the coordination problem: workflows in Temporal are normal code (Python, Go, Java, TypeScript) that appears to run sequentially, but Temporal persists the entire execution history to a distributed database (Cassandra or MySQL). Every activity (unit of work) and every workflow state transition is recorded. If the Temporal Worker crashes mid-execution, on restart the workflow replays its event history to resume exactly where it stopped — without the developer writing any persistence or replay logic. Key differences: (1) Workflows can be arbitrarily long-running (days, months) — Temporal handles persistence, resumability, and timer callbacks. (2) Complex control flow (if/else, loops, waiting for human approval, racing multiple activities) is expressed in regular code. (3) Temporal provides a full execution history visible in its UI — debugging failed workflows shows exactly which activity failed and with what error. Use Celery/Sidekiq for: simple background jobs, high-throughput short tasks, teams already familiar with them. Use Temporal for: multi-step workflows, workflows that span minutes to months, financial operations requiring exactly-once semantics, or any workflow requiring human-in-the-loop steps.

Question 1

What is the difference between at-least-once and exactly-once task execution?

Accepted Answer

At-least-once execution guarantees a task will eventually complete but may execute more than once. This happens because the queue keeps a message until it receives an acknowledgment from the worker. If the worker crashes after completing the task but before sending the ack, the queue re-delivers the message to another worker — the task executes twice. At-least-once is the default for most task queues (Celery, Sidekiq, SQS). To use at-least-once safely, tasks must be idempotent: running them twice produces the same result as running them once. For many tasks this is natural (sending an email with idempotency key, updating a record to a specific value, inserting with ON CONFLICT DO NOTHING). At-most-once execution deletes the message before executing — if the worker crashes, the task is lost. Used when task loss is preferable to duplication (analytics events, best-effort notifications). Exactly-once is the hardest guarantee: the task runs exactly once even under crashes and network failures. It requires atomic coordination between consuming the message and recording the result — typically via two-phase commit or distributed transactions. Temporal achieves durable exactly-once semantics through event sourcing: every activity's result is persisted before the workflow advances, so crashes cause replay rather than re-execution. The practical advice: design tasks to be idempotent and use at-least-once — it is simpler to implement, scales better, and is sufficient for 99% of use cases.

Question 2

How do you design a cron scheduler that handles millions of scheduled jobs reliably?

Accepted Answer

A naive cron scheduler runs a single process that evaluates all schedules every minute — this breaks at scale (millions of schedules × CPU per evaluation × network to fetch schedules). A scalable design uses a distributed architecture: (1) Schedule storage: store all cron schedules in a database (job_id, cron_expression, last_run_at, next_run_at, payload). The next_run_at field is pre-computed when a schedule is saved or after each execution. (2) Polling with sharding: multiple scheduler instances each own a shard of job IDs (consistent hashing or explicit shard assignment). Each instance polls its shard: SELECT * FROM cron_jobs WHERE next_run_at <= NOW() AND shard_id = ? LIMIT 1000. At scheduled time, it enqueues the job into the task queue and updates next_run_at = compute_next_run(cron_expression, NOW()). (3) Leader election for singletons: some jobs must run on exactly one node (e.g., database cleanup). Use Redis SETNX or ZooKeeper for leader election — only the leader polls and enqueues singleton jobs. (4) Clock skew tolerance: different machines may disagree on current time by 1-2 seconds. Use >= (next_run_at <= NOW() + 1 second) to catch jobs that should have run but haven't. (5) Delayed execution handling: if the scheduler is down for 10 minutes, jobs that were due during that window must be caught up. A "missed window" policy decides whether to run them now or skip them. Airflow's catchup=True mode re-runs all missed schedule intervals; most task schedulers run only the latest missed instance.

Question 3

How does Temporal differ from a traditional task queue like Celery or Sidekiq?

Accepted Answer

Traditional task queues (Celery, Sidekiq, Resque, BullMQ) are simple: a producer enqueues a message, a worker dequeues and executes it, the result is optionally stored. They excel at simple, short-duration background tasks but have limitations for complex workflows: you must manually implement retry logic, timeout handling, and state persistence. If a multi-step workflow crashes after step 3 of 10, you must manually figure out what to replay. Temporal is a durable workflow engine. It solves the coordination problem: workflows in Temporal are normal code (Python, Go, Java, TypeScript) that appears to run sequentially, but Temporal persists the entire execution history to a distributed database (Cassandra or MySQL). Every activity (unit of work) and every workflow state transition is recorded. If the Temporal Worker crashes mid-execution, on restart the workflow replays its event history to resume exactly where it stopped — without the developer writing any persistence or replay logic. Key differences: (1) Workflows can be arbitrarily long-running (days, months) — Temporal handles persistence, resumability, and timer callbacks. (2) Complex control flow (if/else, loops, waiting for human approval, racing multiple activities) is expressed in regular code. (3) Temporal provides a full execution history visible in its UI — debugging failed workflows shows exactly which activity failed and with what error. Use Celery/Sidekiq for: simple background jobs, high-throughput short tasks, teams already familiar with them. Use Temporal for: multi-step workflows, workflows that span minutes to months, financial operations requiring exactly-once semantics, or any workflow requiring human-in-the-loop steps.

System Design Interview: Job Scheduler and Task Queue

Why Task Queues?

Core Architecture

Delivery Semantics

Priority Queues

Delay Queues and Scheduled Jobs

Dead Letter Queue (DLQ)

Temporal: Durable Workflows

Scaling Workers

Interview Questions

Frequently Asked Questions

What is the difference between at-least-once and exactly-once task execution?

How do you design a cron scheduler that handles millions of scheduled jobs reliably?

How does Temporal differ from a traditional task queue like Celery or Sidekiq?

Companies That Ask This Question