Job Coordinator Low-Level Design: DAG Execution, Dependency Tracking, and Failure Recovery

What a Job Coordinator Solves

A job coordinator manages the execution of complex multi-step workflows where tasks have dependencies on each other. Without coordination, teams build ad-hoc chains of cron jobs that cannot handle failures gracefully, have no visibility into partial progress, and cannot efficiently re-run only the failed portion of a long pipeline. A job coordinator gives you dependency-aware scheduling, fault-tolerant execution, and observability into every task's state.

DAG Representation

Jobs are modeled as Directed Acyclic Graphs (DAGs). Nodes represent tasks; directed edges represent dependencies. An edge from task A to task B means B cannot start until A has successfully completed. The coordinator validates on job submission that the graph is acyclic — a cycle would cause deadlock (A waits for B, B waits for A).

DAG storage schema:

tasks (
  job_id       UUID,
  task_id      VARCHAR,
  deps         VARCHAR[],   -- task_ids that must succeed first
  status       ENUM,
  worker_id    VARCHAR,
  lease_expires TIMESTAMPTZ,
  attempts     INT,
  created_at   TIMESTAMPTZ
)

The deps array stores the task IDs that must be in SUCCEEDED state before this task can be enqueued. Storing deps denormalized per-task avoids joins during the scheduling hot path.

Task State Machine

Each task transitions through a defined set of states:

PENDING: Created, waiting for dependencies to complete
QUEUED: All dependencies succeeded; placed in the work queue
RUNNING: Claimed by a worker, actively executing
SUCCEEDED: Completed successfully; downstream tasks may now be unblocked
FAILED: Exceeded retry limit; human intervention required
RETRYING: Failed attempt, within retry budget, waiting for backoff delay

Transitions are recorded with timestamps. The full state history enables latency analysis (why did this task sit in QUEUED for 10 minutes?) and retry auditing.

Scheduler Loop

The scheduler runs continuously. On each tick it queries for tasks where: (1) status is PENDING and (2) all tasks listed in deps have status SUCCEEDED. Qualifying tasks are moved to QUEUED and written to the work queue. The scheduler must handle concurrent writes safely — use optimistic locking or a SELECT FOR UPDATE SKIP LOCKED pattern to avoid double-enqueuing tasks in a multi-scheduler setup.

Priority queues allow higher-priority jobs to jump ahead of lower-priority ones within the work queue. Per-job max parallelism limits how many of a single job's tasks can run concurrently, preventing one large job from starving all others.

Worker Lease and Heartbeat

Workers claim tasks using a lease mechanism. When a worker picks a task, it writes its worker_id and a lease_expires timestamp (e.g., now + 60 seconds) to the task record. The worker must heartbeat — extend the lease — every 30 seconds while the task is running. If the worker crashes, heartbeats stop, and the lease expires. The scheduler detects expired leases and reassigns the task to a new worker. This provides automatic recovery from worker failures without manual intervention.

Lease expiry must be longer than the maximum expected task execution time to avoid false reassignments. For long-running tasks, use a generous lease TTL (5–10 minutes) combined with frequent heartbeats (every 30 seconds), so a crashed worker is detected quickly but a slow task is not incorrectly reassigned.

Task Output Passing

Tasks in a DAG often need to pass data to downstream tasks. The pattern: each task writes its output to shared storage (S3, GCS, or a database table) keyed by task_id. Downstream tasks read their dependencies' outputs by querying the storage for each upstream task_id. This avoids passing large payloads through the task queue and allows outputs to be reused across multiple downstream tasks without duplication.

Retry Policy and Non-Retryable Errors

Transient failures (network timeouts, temporary resource exhaustion) should be retried; permanent failures (invalid input, schema violations) should not. The retry policy specifies a maximum attempt count and backoff strategy (exponential with jitter: delay = min(2^attempts * base_delay + random_jitter, max_delay)). Workers classify errors as retryable or non-retryable and record the classification in the task's failure metadata. Non-retryable errors immediately transition to FAILED without consuming remaining retry budget.

Checkpoint and Partial Re-execution

When a long-running job fails partway through, re-running the entire DAG from scratch wastes compute and time. Checkpointing enables partial re-execution: on re-run, tasks that already SUCCEEDED are skipped. The scheduler only enqueues tasks that are PENDING or FAILED, treating already-succeeded tasks as already-completed dependencies. This requires that tasks be idempotent — running a task twice with the same input produces the same output and has no additional side effects. Idempotency should be a design requirement for any task in a coordinator-managed DAG.

Partial re-execution is especially valuable for data pipeline jobs where early-stage tasks (data ingestion, parsing) succeed but a late-stage transformation fails. Re-running only the failed subgraph takes seconds; re-running from scratch might take hours.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How does the scheduler determine which tasks are ready to run in a DAG?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The scheduler maintains an in-degree counter for each task node; when a task completes successfully it decrements the counter of every downstream dependent, and any task whose counter reaches zero is enqueued onto the ready queue for worker pickup. This topological walk is typically implemented with a Kahn's algorithm variant running inside a persistent loop that re-evaluates the DAG state after each completion event.”
}
},
{
“@type”: “Question”,
“name”: “How does lease-based locking prevent duplicate task execution?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Before executing a task, a worker writes a lease record (task ID + worker ID + expiry timestamp) to a strongly consistent store (etcd, ZooKeeper, or a database row with a conditional write), and proceeds only if the write succeeds; if the lease already exists, the worker skips the task. The lease TTL is set shorter than the expected task duration so that a crashed worker's lease expires and another worker can safely reclaim the task.”
}
},
{
“@type”: “Question”,
“name”: “How is a partial DAG re-executed after a failure?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The coordinator stores the completion status and output artifact URI for each task in durable storage; on re-run it skips any task marked SUCCEEDED and only re-enqueues tasks that are FAILED or whose upstream tasks were re-run (to ensure output freshness). This allows an expensive DAG that completed 90% of its tasks to resume from the point of failure rather than reprocessing all prior nodes.”
}
},
{
“@type”: “Question”,
“name”: “How is task output passed between dependent tasks?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Small outputs (parameters, row counts) are stored as structured metadata in the coordinator's task record and injected as environment variables or arguments into the downstream task's command template at dispatch time. Large intermediate datasets are written to a shared object store (S3, GCS) by the upstream task, and the coordinator records the resulting URI in the task metadata so downstream tasks can fetch exactly that artifact version.”
}
}
]
}