Cron Service Low-Level Design: Distributed Scheduling, Exactly-Once Execution, and Missed Job Recovery

Requirements and Constraints

A distributed cron service triggers jobs on time-based schedules (standard cron expressions) across a fleet of nodes with exactly-once execution semantics. Functional requirements: register named cron jobs with a cron expression and a handler reference, fire each job at the correct time regardless of which node is leader, guarantee that each scheduled firing executes exactly once even during node failures or restarts, and recover missed firings that were skipped due to downtime.

Key constraints: the system must handle thousands of registered cron jobs, clock skew across nodes must not cause duplicate or missed firings, and the recovery window for missed jobs must be configurable per job type. Execution must be decoupled from scheduling — the cron service enqueues work, a job scheduler executes it.

Core Data Model

Cron Jobs Table

  • cron_job_id (UUID) — primary key
  • name (varchar, unique) — human-readable identifier
  • schedule (varchar) — standard cron expression, e.g., 0 */6 * * *
  • timezone (varchar) — IANA timezone for schedule evaluation
  • job_type (varchar) — handler registered in the job scheduler
  • payload_template (jsonb) — static parameters for the handler
  • max_missed_firings (int) — how many past-due firings to recover on startup
  • enabled (boolean)
  • last_scheduled_at (timestamptz) — the last firing time that was successfully enqueued

Firing Log Table

  • firing_id (UUID) — primary key
  • cron_job_id (UUID)
  • scheduled_at (timestamptz) — the nominal firing time per the cron schedule
  • enqueued_at (timestamptz) — actual time the job was enqueued
  • job_id (UUID) — FK to the job scheduler's jobs table
  • status (enum) — ENQUEUED, SUCCEEDED, FAILED

A unique index on (cron_job_id, scheduled_at) is the exactly-once enforcement mechanism — inserting a duplicate firing for the same nominal time fails with a unique constraint violation.

Key Algorithms and Logic

Leader-Based Scheduling

The cron service uses leader election (via distributed lock or Raft-based service) so that only one node evaluates schedules at any time. The leader runs a tick loop every 1 second:

  • For each enabled cron job, compute the next firing time after last_scheduled_at using the cron expression parser.
  • If next_firing_time <= NOW() + clock_skew_buffer (e.g., 5 seconds), attempt to insert a row into the firing log with scheduled_at = next_firing_time.
  • On successful insert (no unique conflict), enqueue a job into the job scheduler and update last_scheduled_at.
  • On unique conflict (another node already fired this interval), skip silently.

Clock Skew Handling

The leader fires jobs up to clock_skew_buffer seconds early to account for clock drift across nodes. The job scheduler's run_after field is set to the exact scheduled_at time, so the actual execution does not start early even if it was enqueued early. This separates the concerns of scheduling (when to enqueue) from execution timing (when to run).

Missed Job Recovery

On leader election or service restart, the recovery process:

  • For each enabled cron job, compute all firing times between last_scheduled_at and NOW() using the cron expression.
  • Limit recovery to the most recent max_missed_firings intervals (older missed firings are skipped to avoid overwhelming the job scheduler after a long outage).
  • Attempt to insert each missed firing into the firing log; conflicts are ignored (already processed by another node before the outage).
  • Successfully inserted firings are enqueued into the job scheduler with run_after = NOW() (immediate execution, since the scheduled time has passed).

Exactly-Once Guarantee

The unique index on (cron_job_id, scheduled_at) in the firing log ensures that even if two nodes simultaneously attempt to schedule the same firing (e.g., during a brief split-brain), only one insert succeeds. The insert and job enqueue are wrapped in a database transaction, so a partial failure (insert succeeds, enqueue fails) rolls back the insert and the next tick retries.

API Design

  • POST /cron-jobs — register a new cron job; body: { name, schedule, timezone, job_type, payload_template, max_missed_firings }.
  • PUT /cron-jobs/{name} — update schedule or payload template; changes take effect at the next tick.
  • DELETE /cron-jobs/{name} — disable and remove a cron job.
  • GET /cron-jobs/{name}/firings?from=&to= — paginated firing history with status of each execution.
  • POST /cron-jobs/{name}/trigger — manually trigger an immediate firing outside the normal schedule (does not insert a firing log row).

Scalability Considerations

  • Thousands of cron jobs: the tick loop iterates over all enabled jobs; index last_scheduled_at and filter to jobs whose next firing is within the next 60 seconds to avoid evaluating every job every second.
  • High-frequency schedules: for jobs scheduled more frequently than once per minute, verify that execution time is reliably less than the schedule interval, or use a dedicated worker pool with its own rate limit.
  • Multi-region: run the cron service leader in one region; other regions act as standby. Use a global distributed lock (e.g., across a 3-region Redis Sentinel cluster) for leader election. Failover completes within the election timeout.
  • Firing log size: archive firing log rows older than 90 days to cold storage; retain recent rows for recovery and audit.
  • Observability: alert on firing lag (difference between scheduled_at and enqueued_at exceeding 30 seconds), on missed firings exceeding the recovery limit, and on cron jobs with consistently failing executions.

See also: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence

See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering

See also: Atlassian Interview Guide

Scroll to Top