How do you prevent duplicate job execution in a distributed scheduler?

Use atomic job acquisition with database-level locking: UPDATE jobs SET status='running', worker_id=X WHERE job_id=Y AND status='pending'. Check rowcount=1 to confirm exclusive acquisition. Alternatively, use SELECT FOR UPDATE SKIP LOCKED (PostgreSQL) to efficiently claim the next available job. Only one worker acquires each job even with many concurrent workers.

How does a distributed job scheduler handle worker crashes?

Workers send periodic heartbeats by updating a timestamp on the running job row. A watchdog process monitors for jobs with status=running but heartbeat older than a threshold (e.g., 5 minutes). These stuck jobs are reset to pending for retry. Set the heartbeat interval to a fraction of the stuck threshold (e.g., heartbeat every 30s, stuck threshold 5 minutes) to avoid false positives.

What is the difference between a job scheduler and a workflow engine?

A job scheduler triggers individual tasks at scheduled times or in response to events. A workflow engine orchestrates complex multi-step processes with dependencies (DAGs): task B runs only after task A succeeds. Tools like Apache Airflow, Temporal, and Prefect provide DAG modeling, partial failure handling, retry of failed branches, and visual monitoring. Use a job scheduler for simple recurring tasks and a workflow engine for complex pipelines.

How do you implement priority queues in a distributed job scheduler?

Assign jobs a priority level (critical, high, normal, low). Workers poll higher-priority queues first, falling through to lower priorities only when higher queues are empty. In Redis-backed schedulers, maintain separate lists per priority. In database schedulers, ORDER BY priority DESC, next_run_at ASC. Add starvation prevention: promote low-priority jobs to higher priority if they wait beyond a configurable threshold.

Low Level Design: Distributed Job Scheduler

A distributed job scheduler executes tasks at specified times or intervals across a cluster of workers. It must handle at-least-once execution, failure recovery, backpressure, and fair scheduling across tenants. The core components are a job store, a scheduler process that triggers jobs, and a worker pool that executes them.

Job Store

Jobs are persisted in a database: job_id, type, payload (JSON), schedule (cron expression or next_run_at timestamp), status (pending, running, completed, failed), max_retries, retry_count, created_at, updated_at, last_run_at. Index on (status, next_run_at) for efficient polling. Use a relational database for small-to-medium scale; for high throughput, partition by next_run_at or use a purpose-built job store (Sidekiq backed by Redis, Temporal, Apache Airflow).

Leader-Based Scheduling

A scheduler process polls the job store for jobs due to run (next_run_at <= now AND status = pending). To avoid duplicate scheduling in a multi-instance deployment, use leader election: only the leader instance polls and schedules. Leader election via distributed lock (Redis SETNX with TTL, etcd, or ZooKeeper). If the leader fails, a follower acquires the lock and takes over within one TTL period (typically 10-30 seconds).

Optimistic Locking for Job Acquisition

Workers claim jobs atomically using optimistic locking or database-level locking: UPDATE jobs SET status='running', worker_id=X, started_at=now WHERE job_id=Y AND status='pending'. The rowcount check confirms exclusive acquisition (rowcount=1 means this worker won the claim). Alternatively, use SELECT FOR UPDATE SKIP LOCKED (PostgreSQL) to efficiently claim the next unclaimed job without lock contention.

Heartbeat and Failure Detection

Running workers periodically update a heartbeat timestamp on the job row. A watchdog process (or the scheduler) monitors for jobs with status=running but heartbeat older than a threshold (e.g., 5 minutes). These are stuck jobs: the worker crashed without completing. Reset their status to pending for retry. Heartbeat interval should be much shorter than the stuck threshold to avoid false positives.

Cron Expression Parsing

Recurring jobs use cron expressions (e.g., 0 2 * * * for daily at 2am). After a job completes, compute the next scheduled time by evaluating the cron expression against the current time. Libraries like croniter (Python) or cron-parser (Node.js) handle daylight saving transitions, month-end edge cases, and complex expressions (0 */4 * * * for every 4 hours). Store next_run_at as a UTC timestamp.

Priority Queues

Jobs are categorized into priority tiers: critical (payment processing), high (email sending), normal (report generation), low (cleanup jobs). Workers pull from higher-priority queues first. In Redis-backed queues (Sidekiq), each priority is a separate list. In database-backed schedulers, add a priority column and ORDER BY priority DESC, next_run_at ASC. Starvation prevention: promote low-priority jobs to higher priority if they wait beyond a threshold.

Idempotency and Exactly-Once Semantics

At-least-once execution is the standard guarantee: a job may run more than once due to worker crash and retry. Job handlers must be idempotent: running the same job twice produces the same result. Use an idempotency key (job_id) to deduplicate: check if the effect of the job has already been applied before executing. For truly exactly-once semantics, use transactional outbox: commit the job result and the job status update in the same database transaction.

Workflow Orchestration

Complex jobs with dependencies (DAGs) require workflow orchestration beyond simple scheduling. Tools like Apache Airflow, Temporal, and Prefect model dependencies as directed acyclic graphs: task B runs only after task A completes successfully. The orchestrator tracks DAG state, handles partial failures (retry only failed branches), and provides a visual DAG editor and monitoring UI. Job scheduler + workflow engine handles the full spectrum from simple cron to complex pipelines.