The Distributed Scheduler Problem
A scheduler that runs on a single node is a single point of failure. A scheduler that runs on multiple nodes without coordination will execute each job multiple times. The design goal is exactly-once execution of every scheduled job across a cluster of scheduler nodes, with automatic recovery when any node fails.
Partitioning Jobs Across Nodes
Jobs are partitioned by hashing their job_id into N buckets: partition = job_id % N. Each scheduler node owns a disjoint subset of partitions. A cluster of 4 nodes with N=16 partitions means each node scans 4 partitions. Increasing N allows finer-grained rebalancing without changing the hash function.
Leader Election Per Partition
Each partition has exactly one leader responsible for scanning and claiming jobs. Leadership is acquired by one of two approaches:
- ZooKeeper ephemeral node: First node to create
/schedulers/partition-{n}/leaderwins. Node death deletes the ephemeral node, triggering re-election. - Database advisory lock:
SELECT pg_try_advisory_lock(partition_id). Lock released automatically when the DB connection drops.
Leaders heartbeat every 10 seconds. Followers monitor the heartbeat and attempt to acquire leadership if the leader goes silent.
Job Schema
jobs(
job_id UUID PRIMARY KEY,
schedule_expr TEXT, -- cron or RRULE
timezone TEXT, -- e.g. America/New_York
handler_type TEXT,
payload JSONB,
status TEXT, -- PENDING | CLAIMED | RUNNING | COMPLETED | FAILED
next_run_at TIMESTAMPTZ,
last_run_at TIMESTAMPTZ,
claimed_by TEXT,
claimed_at TIMESTAMPTZ,
attempts INT DEFAULT 0
)
Job Claim: Exactly-Once Execution
The partition leader issues an atomic claim query every 5 seconds:
UPDATE jobs
SET status = 'CLAIMED',
claimed_by = :node_id,
claimed_at = now()
WHERE status = 'PENDING'
AND next_run_at <= now()
AND job_id % :N = :partition
LIMIT 100
RETURNING job_id
Because the UPDATE is atomic at the database level, two leaders competing for the same job (e.g., during a split-brain moment) will each claim a disjoint set — one will update 0 rows. No job runs twice.
Execution and Rescheduling
After claim, the leader dispatches each job to a worker pool. The worker:
- Executes the job handler with an idempotency key derived from
job_id + next_run_at - On success: sets
status=COMPLETED,last_run_at=now(), computes and writes the nextnext_run_atfrom the schedule expression, resetsstatus=PENDING - On failure: increments
attempts, setsstatus=FAILEDifattempts > max_retries, else resets toPENDINGwith a backoff delay added tonext_run_at
Missed Job Recovery
If a node crashes after claiming a job but before completing it, the job stays CLAIMED forever. A recovery query runs every minute:
UPDATE jobs
SET status = 'PENDING', claimed_by = NULL, claimed_at = NULL
WHERE status = 'CLAIMED'
AND claimed_at < now() - INTERVAL '5 minutes'
The job re-enters the pending pool and will be claimed by any active leader on the next tick.
Timezone-Aware Scheduling
Schedule expressions are stored with a timezone. next_run_at is always stored in UTC. When computing the next run time:
next_run_at = next_occurrence(schedule_expr, timezone=user_tz).astimezone(UTC)
This handles DST transitions correctly. A job scheduled for “9:00 AM America/New_York” fires at the correct wall-clock time even when clocks change.
Catch-Up Window
If the entire scheduler cluster was down for 2 hours and 50 jobs are past-due, blindly executing all of them might cause unintended side effects. The catch-up window limits execution to jobs due within the last 30 minutes. Jobs older than that have their next_run_at advanced to the next scheduled occurrence and skipped. The behavior (catch up vs skip) is configurable per job.
Exactly-Once with Idempotency Keys
For jobs whose handlers have external side effects (send email, charge card), the idempotency key job_id + next_run_at is passed to the downstream system. Even if the job is accidentally executed twice (e.g., during a recovery race), the downstream deduplicates on the idempotency key and the effect is applied once.
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How does the distributed scheduler ensure exactly-once job execution?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Exactly-once execution is achieved by using a distributed lock (e.g., via Redis SET NX or a database compare-and-swap) that a node must acquire before firing a job trigger; the lock TTL is set slightly longer than the expected execution time. The job record is updated to 'running' state with the acquiring node's ID atomically, so competing nodes skip the job on conflict.”
}
},
{
“@type”: “Question”,
“name”: “How are missed jobs recovered after a scheduler node failure?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A watchdog process or peer node periodically scans for jobs whose next_run_at is in the past and whose lock has expired, then re-enqueues them with a 'missed' flag so the executor can apply catch-up logic. Policies such as fire-once, fire-all, or discard are stored per-job to control how many missed triggers are replayed.”
}
},
{
“@type”: “Question”,
“name”: “How is the scheduler partitioned across multiple nodes?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Jobs are assigned to partitions (shards) by hashing their job ID modulo the number of scheduler nodes, with each node owning a subset of partitions using consistent hashing to minimize reshuffling on membership changes. A coordination layer such as ZooKeeper or etcd maintains the partition-to-node mapping and triggers rebalancing when a node joins or leaves.”
}
},
{
“@type”: “Question”,
“name”: “How are timezone-aware schedules stored and computed?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Cron expressions are stored alongside an IANA timezone identifier (e.g., 'America/New_York') rather than an offset, and next-fire times are computed by converting the current instant to the job's local time, advancing by the cron expression, then converting back to UTC. This correctly handles DST transitions, including the duplicate or skipped hours that occur twice a year.”
}
}
]
}
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering
See also: Uber Interview Guide 2026: Dispatch Systems, Geospatial Algorithms, and Marketplace Engineering