System Design Interview: Design a Distributed Job Scheduler
A distributed job scheduler runs tasks reliably across a cluster of machines. It powers cron jobs, ETL pipelines, ML training jobs, and background task processing. Systems like Apache Airflow, Celery, AWS Batch, and Kubernetes Jobs solve this problem. This guide covers the key design decisions for a system design interview.
Requirements
Functional: schedule one-time and recurring jobs (cron expressions), execute jobs on a pool of workers, track job status (pending/running/succeeded/failed), retry failed jobs, support job dependencies (DAGs), provide an API and UI for monitoring.
Non-functional: at-least-once execution, job starts within 1 minute of scheduled time, handles 1M jobs/day, workers can be added/removed without downtime.
Core Components
1. Job Store (Persistent Queue)
Jobs are stored in a database with their schedule and state:
CREATE TABLE jobs (
id UUID PRIMARY KEY,
name VARCHAR(255),
cron_expr VARCHAR(100), -- "0 * * * *" = every hour
next_run_at TIMESTAMP, -- pre-computed next execution time
status ENUM('pending','running','succeeded','failed'),
payload JSONB, -- job parameters
max_retries INT DEFAULT 3,
retry_count INT DEFAULT 0,
created_at TIMESTAMP,
updated_at TIMESTAMP
);
CREATE INDEX idx_next_run ON jobs(next_run_at, status);
2. Scheduler Service
The scheduler polls for jobs due to run and enqueues them:
# Every 10 seconds:
SELECT id FROM jobs
WHERE next_run_at <= NOW() AND status = 'pending'
FOR UPDATE SKIP LOCKED -- distributed locking: each scheduler picks different rows
LIMIT 1000;
# For each selected job:
UPDATE jobs SET status='running', updated_at=NOW() WHERE id=?
# Enqueue to task queue (Kafka / SQS / Redis Queue)
# Update next_run_at = compute_next(cron_expr, NOW()) for recurring jobs
FOR UPDATE SKIP LOCKED: PostgreSQL's mechanism for distributed locking. Multiple scheduler instances can run; each atomically claims different rows. No separate distributed lock (ZooKeeper/Redis) needed.
3. Task Queue
The scheduler enqueues jobs to a distributed queue. Workers consume from this queue:
- Kafka: good for high-throughput, ordered processing within a partition, durable replay. Use for event-driven workflows.
- SQS/RabbitMQ: simpler, exactly-once delivery (with deduplication ID), good for background task processing.
- Redis Queue (Celery default): simple, low-latency, but Redis is not durable by default — risk of job loss on restart. Use AOF persistence.
4. Worker Pool
Workers are stateless processes that pull jobs from the queue and execute them:
while True:
job = queue.dequeue(timeout=30)
if job is None: continue
try:
result = execute(job)
mark_succeeded(job.id, result)
except Exception as e:
mark_failed(job.id, str(e))
if job.retry_count < job.max_retries:
reschedule_with_backoff(job) # 1min, 5min, 30min delays
Workers are horizontally scalable — add more workers to handle more jobs. Each worker should have a heartbeat timeout: if a worker dies mid-job, a watchdog service detects the missed heartbeat and re-enqueues the job.
Exactly-Once vs. At-Least-Once
True exactly-once execution is impossible without transactional outboxes. In practice:
- At-least-once: job may run more than once if the worker crashes after execution but before acknowledging. Make jobs idempotent (use a deduplication key, check-before-insert).
- Transactional outbox pattern: write job state update and queue message in the same database transaction. A separate relay process reads the outbox table and publishes to the queue. Ensures the job is never lost and never double-enqueued.
DAG Dependencies (Apache Airflow Model)
Some workflows require tasks to run in order: task B only starts after task A succeeds. Model this as a Directed Acyclic Graph (DAG):
CREATE TABLE task_dependencies (
upstream_task_id UUID,
downstream_task_id UUID,
PRIMARY KEY (upstream_task_id, downstream_task_id)
);
# When task A succeeds:
# 1. Check if all upstream dependencies of B are succeeded
# 2. If yes, enqueue B
# Topological sort ensures no cycles
Handling Missed Jobs
If the scheduler is down for 2 hours, 120 hourly jobs are missed. Strategies:
- Catch-up: run all missed executions in sequence when the scheduler recovers (Airflow default)
- Skip: only run the most recent missed execution (suitable for reporting jobs)
- Ignore: skip all missed, resume from next scheduled time (suitable for health checks)
Configurable per job — the scheduler sets the policy when computing next_run_at after recovery.
Monitoring and Alerting
- Alert if a job has not completed within 2x its average duration (runaway job)
- Alert if a scheduled job has not started within 5 minutes of its scheduled time (scheduler lag)
- Dashboard: pending/running/failed counts, p99 job duration, failure rate per job type
- Dead letter queue (DLQ): after max retries, move job to DLQ for human investigation
Interview Tips
- Lead with FOR UPDATE SKIP LOCKED — it is the elegant distributed locking solution many candidates miss
- Address the at-least-once vs. exactly-once question explicitly — say jobs should be idempotent
- Mention the heartbeat/watchdog pattern for dead worker detection
- If asked about Airflow-like DAGs, describe topological dependency resolution
- For scale: “the scheduler itself is not the bottleneck — the workers are. Add workers by pulling from the queue.”
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How does FOR UPDATE SKIP LOCKED enable distributed job scheduling?”,
“acceptedAnswer”: { “@type”: “Answer”, “text”: “When multiple scheduler instances run concurrently, they must not pick the same job. FOR UPDATE SKIP LOCKED (PostgreSQL/MySQL) acquires a row-level lock and skips rows already locked by other transactions. Each scheduler atomically selects and locks a batch of due jobs in one SQL statement: SELECT id FROM jobs WHERE next_run_at <= NOW() AND status='pending' FOR UPDATE SKIP LOCKED LIMIT 100. Rows locked by another scheduler instance are automatically skipped. This provides distributed mutual exclusion without an external lock service (ZooKeeper, Redis Redlock). After selecting, update status to 'running' and commit the transaction – releasing the lock while keeping the status update.” }
},
{
“@type”: “Question”,
“name”: “How do you ensure at-least-once execution and handle worker failures?”,
“acceptedAnswer”: { “@type”: “Answer”, “text”: “At-least-once execution: after a worker picks up a job, it must acknowledge completion. If the worker crashes, the job should be re-enqueued. Implement with a heartbeat timeout: the worker updates a heartbeat timestamp every 30 seconds while running. A watchdog service scans for jobs with status=running but heartbeat older than 2 minutes – these workers are presumed dead, and the job is reset to pending and re-enqueued. Because jobs may run more than once, make them idempotent: use a deduplication key (job_id + run_id) and check-before-insert for any side effects. For stronger guarantees, use the transactional outbox pattern: write both the job status update and the queue message in the same database transaction.” }
},
{
“@type”: “Question”,
“name”: “How do you model job dependencies (DAG workflows) in a job scheduler?”,
“acceptedAnswer”: { “@type”: “Answer”, “text”: “Store task dependencies in a separate table: upstream_task_id -> downstream_task_id. When a task completes successfully, check if all upstream dependencies of each downstream task are now complete. If yes, change downstream task status from waiting to pending and enqueue it. This requires a topological sort to validate the DAG has no cycles at submission time. Apache Airflow uses this model: a DAG is defined in Python, Airflow validates it is acyclic, and the scheduler uses the dependency table to trigger tasks. For concurrent execution: tasks with no unmet dependencies can run in parallel. For large DAGs (1000+ tasks), batch the dependency checks to avoid N+1 queries.” }
}
]
}