How does the distributed scheduler ensure exactly-once job execution?

Exactly-once execution is achieved by using a distributed lock (e.g., via Redis SET NX or a database compare-and-swap) that a node must acquire before firing a job trigger; the lock TTL is set slightly longer than the expected execution time. The job record is updated to 'running' state with the acquiring node's ID atomically, so competing nodes skip the job on conflict.

How are missed jobs recovered after a scheduler node failure?

A watchdog process or peer node periodically scans for jobs whose next_run_at is in the past and whose lock has expired, then re-enqueues them with a 'missed' flag so the executor can apply catch-up logic. Policies such as fire-once, fire-all, or discard are stored per-job to control how many missed triggers are replayed.

How is the scheduler partitioned across multiple nodes?

Jobs are assigned to partitions (shards) by hashing their job ID modulo the number of scheduler nodes, with each node owning a subset of partitions using consistent hashing to minimize reshuffling on membership changes. A coordination layer such as ZooKeeper or etcd maintains the partition-to-node mapping and triggers rebalancing when a node joins or leaves.

How are timezone-aware schedules stored and computed?

Cron expressions are stored alongside an IANA timezone identifier (e.g., 'America/New_York') rather than an offset, and next-fire times are computed by converting the current instant to the job's local time, advancing by the cron expression, then converting back to UTC. This correctly handles DST transitions, including the duplicate or skipped hours that occur twice a year.

Distributed Scheduler Low-Level Design: Clock-Based Triggering, Exactly-Once Execution, and Partition Tolerance

⏱ 4 min read

The Distributed Scheduler Problem

A scheduler that runs on a single node is a single point of failure. A scheduler that runs on multiple nodes without coordination will execute each job multiple times. The design goal is exactly-once execution of every scheduled job across a cluster of scheduler nodes, with automatic recovery when any node fails.

Partitioning Jobs Across Nodes

Jobs are partitioned by hashing their job_id into N buckets: partition = job_id % N. Each scheduler node owns a disjoint subset of partitions. A cluster of 4 nodes with N=16 partitions means each node scans 4 partitions. Increasing N allows finer-grained rebalancing without changing the hash function.

Leader Election Per Partition

Each partition has exactly one leader responsible for scanning and claiming jobs. Leadership is acquired by one of two approaches:

ZooKeeper ephemeral node: First node to create /schedulers/partition-{n}/leader wins. Node death deletes the ephemeral node, triggering re-election.
Database advisory lock: SELECT pg_try_advisory_lock(partition_id). Lock released automatically when the DB connection drops.

Leaders heartbeat every 10 seconds. Followers monitor the heartbeat and attempt to acquire leadership if the leader goes silent.

Job Schema

jobs(
  job_id        UUID PRIMARY KEY,
  schedule_expr TEXT,        -- cron or RRULE
  timezone      TEXT,        -- e.g. America/New_York
  handler_type  TEXT,
  payload       JSONB,
  status        TEXT,        -- PENDING | CLAIMED | RUNNING | COMPLETED | FAILED
  next_run_at   TIMESTAMPTZ,
  last_run_at   TIMESTAMPTZ,
  claimed_by    TEXT,
  claimed_at    TIMESTAMPTZ,
  attempts      INT DEFAULT 0
)

Job Claim: Exactly-Once Execution

The partition leader issues an atomic claim query every 5 seconds:

UPDATE jobs
SET status = 'CLAIMED',
    claimed_by = :node_id,
    claimed_at = now()
WHERE status = 'PENDING'
  AND next_run_at <= now()
  AND job_id % :N = :partition
LIMIT 100
RETURNING job_id

Because the UPDATE is atomic at the database level, two leaders competing for the same job (e.g., during a split-brain moment) will each claim a disjoint set — one will update 0 rows. No job runs twice.

Execution and Rescheduling

After claim, the leader dispatches each job to a worker pool. The worker:

Executes the job handler with an idempotency key derived from job_id + next_run_at
On success: sets status=COMPLETED, last_run_at=now(), computes and writes the next next_run_at from the schedule expression, resets status=PENDING
On failure: increments attempts, sets status=FAILED if attempts > max_retries, else resets to PENDING with a backoff delay added to next_run_at

Missed Job Recovery

If a node crashes after claiming a job but before completing it, the job stays CLAIMED forever. A recovery query runs every minute:

UPDATE jobs
SET status = 'PENDING', claimed_by = NULL, claimed_at = NULL
WHERE status = 'CLAIMED'
  AND claimed_at < now() - INTERVAL '5 minutes'

The job re-enters the pending pool and will be claimed by any active leader on the next tick.

Timezone-Aware Scheduling

Schedule expressions are stored with a timezone. next_run_at is always stored in UTC. When computing the next run time:

next_run_at = next_occurrence(schedule_expr, timezone=user_tz).astimezone(UTC)

This handles DST transitions correctly. A job scheduled for “9:00 AM America/New_York” fires at the correct wall-clock time even when clocks change.

Catch-Up Window

If the entire scheduler cluster was down for 2 hours and 50 jobs are past-due, blindly executing all of them might cause unintended side effects. The catch-up window limits execution to jobs due within the last 30 minutes. Jobs older than that have their next_run_at advanced to the next scheduled occurrence and skipped. The behavior (catch up vs skip) is configurable per job.

Exactly-Once with Idempotency Keys

For jobs whose handlers have external side effects (send email, charge card), the idempotency key job_id + next_run_at is passed to the downstream system. Even if the job is accidentally executed twice (e.g., during a recovery race), the downstream deduplicates on the idempotency key and the effect is applied once.