System Design: Distributed Task Scheduler — Cron Jobs, Delayed Execution, Priority Queue, Exactly-Once, Celery, Temporal

A distributed task scheduler executes jobs at specified times or intervals across a fleet of workers. It powers cron jobs, delayed notifications, scheduled reports, retry mechanisms, and workflow orchestration. Designing a reliable task scheduler tests your understanding of distributed coordination, exactly-once execution, and failure recovery. This guide covers the architecture from task creation to execution — essential for system design interviews.

Requirements and Task Types

Task types: (1) Immediate — execute as soon as possible (background job processing: resize an image, send an email). (2) Delayed — execute after a specified delay (send a reminder email in 24 hours, retry a failed payment in 5 minutes). (3) Scheduled — execute at a specific time (generate a daily report at 2 AM, expire a coupon at midnight). (4) Recurring — execute on a cron schedule (every hour, every Monday at 9 AM). Non-functional requirements: exactly-once execution (a task must not be executed twice or skipped), fault tolerance (worker crashes must not lose tasks), scalability (handle millions of pending tasks), low latency for immediate tasks (execute within seconds of creation), and observability (track task status, execution time, failures). Scale estimation: a large platform might have 100 million scheduled tasks per day. At peak: 10,000 task executions per second. Each task has: task_id, type, payload (serialized arguments), scheduled_time, status, retry_count, max_retries, created_at.

Architecture

Components: (1) Task API — accepts task creation requests (create a task with payload X, execute at time T). Validates input, assigns task_id, stores in the task database, and enqueues for scheduling. (2) Task database — persistent storage for all tasks (PostgreSQL). Stores task definitions, status, execution history, and results. The source of truth for task state. (3) Scheduler — reads tasks from the database where scheduled_time <= now AND status = PENDING, and enqueues them for execution. Runs continuously, polling every second. For delayed tasks: a timer-based mechanism (Redis sorted set with scheduled_time as score, or a database query with index on scheduled_time). (4) Task queue — a message queue (Redis, SQS, Kafka) that buffers tasks ready for execution. Workers pull from the queue. The queue provides load leveling and decouples scheduling from execution. (5) Workers — stateless processes that pull tasks from the queue, execute them, and update the task status. Horizontally scalable: add more workers for higher throughput. (6) Dead letter queue — tasks that fail after max_retries are moved here for investigation.

Exactly-Once Execution

The hardest requirement: ensure each task executes exactly once despite worker crashes, network failures, and scheduler restarts. Problem scenario: worker A picks up task T, starts executing, then crashes. The queue redelivers task T to worker B. Worker A recovers and also completes task T. The task executed twice. Prevention: (1) Distributed lock per task — before executing, the worker acquires a lock (Redis SET task_lock:{task_id} worker_id NX EX 300). Only one worker can hold the lock. The lock TTL (300 seconds) ensures the lock is released if the worker crashes. (2) Idempotent tasks — design tasks to produce the same result when executed multiple times. An email send task checks a sent_emails table before sending. A payment task uses an idempotency key. (3) Status check before execution — the worker reads the task status from the database. If status is already COMPLETED or IN_PROGRESS (by another worker), skip. Use optimistic locking: UPDATE tasks SET status = IN_PROGRESS, worker_id = me WHERE task_id = T AND status = PENDING. If 0 rows updated, another worker claimed it. (4) Heartbeat and lease renewal — while executing, the worker periodically extends the lock TTL. If the worker crashes, the lock expires and another worker can claim the task.

Delayed and Scheduled Task Execution

Delayed tasks must fire at a specific future time. Implementation options: (1) Database polling — a scheduler queries every second: SELECT * FROM tasks WHERE scheduled_time <= NOW() AND status = PENDING ORDER BY scheduled_time LIMIT 100. Index on (status, scheduled_time) makes this efficient. Simple but polling adds load on the database. (2) Redis sorted set — ZADD scheduled_tasks {unix_timestamp} {task_id}. The scheduler runs ZRANGEBYSCORE scheduled_tasks 0 {now} LIMIT 0 100 every second to find tasks due for execution. Redis sorted set operations are O(log N) and extremely fast. Remove tasks after enqueuing: ZREM. (3) Time-wheel algorithm — divide time into slots (buckets). Each slot holds tasks scheduled for that second. A pointer advances every second, executing tasks in the current slot. Efficient for large numbers of tasks with short delays. Used internally by Kafka (delayed message delivery) and Netty (I/O timeouts). For recurring tasks (cron): store the cron expression. After each execution, compute the next execution time and reschedule. Use a cron parser library to compute the next fire time from the expression.

Failure Handling and Retries

Tasks fail. Networks timeout, external APIs return errors, databases are temporarily unavailable. Retry strategy: (1) Exponential backoff — retry after 1s, 2s, 4s, 8s, 16s. Prevents overwhelming a recovering service. (2) Maximum retries — after N failures (e.g., 5), move to the dead letter queue. Do not retry forever. (3) Jitter — add random delay to prevent synchronized retries from multiple failed tasks hitting the same service simultaneously. Implementation: on task failure, set status = RETRY, increment retry_count, compute next_retry_time = now + backoff_delay, and update scheduled_time. The scheduler will pick it up at the next retry time. Stuck task detection: a watchdog process queries for tasks with status = IN_PROGRESS and last_heartbeat older than the lease timeout. These tasks are presumed failed (the worker crashed without updating status). Reset them to PENDING for re-execution. Poison tasks: tasks that always fail (buggy code, impossible input). After max_retries, move to the DLQ. Alert the engineering team. The DLQ allows inspection and manual replay after the bug is fixed.

Production Task Schedulers

Celery (Python): distributed task queue using Redis or RabbitMQ as the broker. Supports immediate, delayed, and periodic (celery beat) tasks. Workers pull tasks from the queue. Simple to set up, widely used in Python applications. Limitations: celery beat (the scheduler for periodic tasks) is a single process — a single point of failure. Use celery-redbeat for distributed scheduling. Temporal (formerly Cadence): a durable execution platform from Uber. Tasks are modeled as workflows with activities. Temporal handles retries, timeouts, and crash recovery automatically. The workflow state is durably persisted — if a worker crashes mid-workflow, another worker resumes from the exact step. Best for: complex, multi-step workflows (order processing, user onboarding, long-running data pipelines). Apache Airflow: DAG-based workflow scheduler for data pipelines. Each DAG defines task dependencies. The scheduler triggers tasks when dependencies are met. Best for: batch data processing, ETL pipelines. Not designed for: low-latency real-time task execution. In system design interviews: mention the scheduler choice and justify. “We use Redis sorted sets for delayed tasks and a worker pool for execution, with distributed locks for exactly-once guarantees.”

{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”How do you ensure a distributed task executes exactly once?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Exactly-once execution prevents both missed and duplicate task runs. Strategies: (1) Distributed lock per task — before executing, acquire a Redis lock: SET task_lock:{id} worker_id NX EX 300. Only one worker holds the lock. TTL ensures release if the worker crashes. (2) Database-level claim — UPDATE tasks SET status=IN_PROGRESS, worker_id=me WHERE task_id=T AND status=PENDING. If 0 rows updated, another worker already claimed it. Optimistic locking prevents races. (3) Idempotent task design — even with at-least-once delivery, the task produces the same result on re-execution. An email task checks sent_emails before sending. A payment task uses an idempotency key. (4) Heartbeat and lease renewal — while executing, the worker extends the lock TTL periodically. If it crashes, the lease expires and another worker can claim the task. Combine all four: claim via database update, acquire a Redis lock, design idempotent tasks, and send heartbeats during execution.”}},{“@type”:”Question”,”name”:”How do you implement delayed task execution efficiently?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Three approaches: (1) Database polling — query every second: SELECT FROM tasks WHERE scheduled_time <= NOW() AND status=PENDING LIMIT 100. Index on (status, scheduled_time). Simple but adds constant database load. (2) Redis sorted set — ZADD scheduled_tasks {timestamp} {task_id}. Poll with ZRANGEBYSCORE 0 {now} LIMIT 100 every second. O(log N) operations, very fast. Remove after enqueuing with ZREM. This is the recommended approach for most systems. (3) Time-wheel algorithm — divide time into second-level buckets. Each bucket holds tasks for that second. A pointer advances each second, executing the current bucket. Efficient for millions of tasks with short delays. Used by Kafka and Netty internally. For recurring tasks: store the cron expression. After execution, compute next_fire_time from the cron expression and reschedule."}},{"@type":"Question","name":"How do you handle task failures and retries?","acceptedAnswer":{"@type":"Answer","text":"Retry strategy: exponential backoff with jitter. After failure: set next_retry_time = now + base_delay * 2^retry_count + random_jitter. Update scheduled_time and status=RETRY. The scheduler picks it up at the retry time. Maximum retries: after N failures (e.g., 5), move to a dead letter queue. Do not retry forever — some tasks have bugs or impossible inputs. Stuck task detection: a watchdog queries for tasks with status=IN_PROGRESS and last_heartbeat older than the lease timeout. These tasks are presumed failed (worker crashed). Reset to PENDING for re-execution. Poison tasks: tasks that always fail contaminate the system. After max_retries, move to DLQ, alert the team. The DLQ allows inspection and manual replay after fixing the bug. Circuit breaker: if a downstream service is down and causing many task failures, stop retrying tasks targeting that service until it recovers. This prevents retry storms from overwhelming the recovering service."}},{"@type":"Question","name":"When should you use Celery versus Temporal versus Airflow?","acceptedAnswer":{"@type":"Answer","text":"Celery: Python task queue for background jobs. Best for: simple async tasks (send email, resize image, process webhook). Uses Redis or RabbitMQ as broker. Easy to set up. Limitation: celery beat (periodic task scheduler) is a single process SPOF. Temporal: durable workflow execution platform. Best for: complex multi-step workflows with retries, timeouts, and crash recovery. If a worker crashes mid-workflow, another resumes from the exact step. Ideal for: order processing, user onboarding, long-running data pipelines, saga pattern implementation. More complex to set up but dramatically more reliable. Airflow: DAG-based workflow scheduler. Best for: batch data pipelines and ETL. Define task dependencies as a DAG. The scheduler triggers tasks when dependencies complete. Not designed for low-latency real-time execution. Choose Celery for simple async jobs, Temporal for complex durable workflows, and Airflow for batch data pipelines."}}]}
Scroll to Top