The Distributed Scheduler Problem
A scheduler that runs on a single node is a single point of failure. A scheduler that runs on multiple nodes without coordination will execute each job multiple times. The design goal is exactly-once execution of every scheduled job across a cluster of scheduler nodes, with automatic recovery when any node fails.
Partitioning Jobs Across Nodes
Jobs are partitioned by hashing their job_id into N buckets: partition = job_id % N. Each scheduler node owns a disjoint subset of partitions. A cluster of 4 nodes with N=16 partitions means each node scans 4 partitions. Increasing N allows finer-grained rebalancing without changing the hash function.
Leader Election Per Partition
Each partition has exactly one leader responsible for scanning and claiming jobs. Leadership is acquired by one of two approaches:
- ZooKeeper ephemeral node: First node to create
/schedulers/partition-{n}/leaderwins. Node death deletes the ephemeral node, triggering re-election. - Database advisory lock:
SELECT pg_try_advisory_lock(partition_id). Lock released automatically when the DB connection drops.
Leaders heartbeat every 10 seconds. Followers monitor the heartbeat and attempt to acquire leadership if the leader goes silent.
Job Schema
jobs(
job_id UUID PRIMARY KEY,
schedule_expr TEXT, -- cron or RRULE
timezone TEXT, -- e.g. America/New_York
handler_type TEXT,
payload JSONB,
status TEXT, -- PENDING | CLAIMED | RUNNING | COMPLETED | FAILED
next_run_at TIMESTAMPTZ,
last_run_at TIMESTAMPTZ,
claimed_by TEXT,
claimed_at TIMESTAMPTZ,
attempts INT DEFAULT 0
)
Job Claim: Exactly-Once Execution
The partition leader issues an atomic claim query every 5 seconds:
UPDATE jobs
SET status = 'CLAIMED',
claimed_by = :node_id,
claimed_at = now()
WHERE status = 'PENDING'
AND next_run_at <= now()
AND job_id % :N = :partition
LIMIT 100
RETURNING job_id
Because the UPDATE is atomic at the database level, two leaders competing for the same job (e.g., during a split-brain moment) will each claim a disjoint set — one will update 0 rows. No job runs twice.
Execution and Rescheduling
After claim, the leader dispatches each job to a worker pool. The worker:
- Executes the job handler with an idempotency key derived from
job_id + next_run_at - On success: sets
status=COMPLETED,last_run_at=now(), computes and writes the nextnext_run_atfrom the schedule expression, resetsstatus=PENDING - On failure: increments
attempts, setsstatus=FAILEDifattempts > max_retries, else resets toPENDINGwith a backoff delay added tonext_run_at
Missed Job Recovery
If a node crashes after claiming a job but before completing it, the job stays CLAIMED forever. A recovery query runs every minute:
UPDATE jobs
SET status = 'PENDING', claimed_by = NULL, claimed_at = NULL
WHERE status = 'CLAIMED'
AND claimed_at < now() - INTERVAL '5 minutes'
The job re-enters the pending pool and will be claimed by any active leader on the next tick.
Timezone-Aware Scheduling
Schedule expressions are stored with a timezone. next_run_at is always stored in UTC. When computing the next run time:
next_run_at = next_occurrence(schedule_expr, timezone=user_tz).astimezone(UTC)
This handles DST transitions correctly. A job scheduled for “9:00 AM America/New_York” fires at the correct wall-clock time even when clocks change.
Catch-Up Window
If the entire scheduler cluster was down for 2 hours and 50 jobs are past-due, blindly executing all of them might cause unintended side effects. The catch-up window limits execution to jobs due within the last 30 minutes. Jobs older than that have their next_run_at advanced to the next scheduled occurrence and skipped. The behavior (catch up vs skip) is configurable per job.
Exactly-Once with Idempotency Keys
For jobs whose handlers have external side effects (send email, charge card), the idempotency key job_id + next_run_at is passed to the downstream system. Even if the job is accidentally executed twice (e.g., during a recovery race), the downstream deduplicates on the idempotency key and the effect is applied once.
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering
See also: Uber Interview Guide 2026: Dispatch Systems, Geospatial Algorithms, and Marketplace Engineering