Data Pipeline Orchestrator Low-Level Design: DAG Scheduling, Backfill, and SLA Monitoring

What an Orchestrator Does

A data pipeline orchestrator schedules and executes directed acyclic graphs (DAGs) of computational tasks. Each DAG represents a workflow: extract data from source, transform it, load to destination. The orchestrator handles dependency resolution, retry logic, backfill of historical data, and alerting when pipelines miss their SLAs. Apache Airflow is the canonical reference implementation.

DAG Definition

Pipelines are defined as code (Python) or config (YAML), specifying:

Tasks: units of work (SQL query, Python function, Spark job, HTTP call)
Dependencies: directed edges — task C runs only after A and B succeed
Schedule: cron expression defining when the DAG runs (0 6 * * * = daily at 6 AM)
Start date: the earliest logical date for which a run should be created

DAG code is parsed by the scheduler process periodically. Changes take effect on the next parse cycle without restart.

Time Partitioning and Logical Date

Each DAG run is associated with a logical date (also called execution date or data interval start). A daily DAG scheduled at 0 6 * * * with logical date 2024-01-15 processes data for January 15th. The logical date is available to tasks as a template variable {{ ds }}:

SELECT * FROM events
WHERE date = '{{ ds }}'

This parameterization makes every DAG run idempotent — the same logical date always processes the same data partition.

Scheduler Loop

The scheduler runs a tight loop:

Find DAGs with next_run_time <= now and no active run blocking them
Create a DagRun record for the new logical date
For each task in the DAG, check if all upstream dependencies are in success state
Queue eligible tasks to the executor
Update task states as the executor reports completion

The scheduler is single-process (in Airflow's classic design) to avoid concurrent DAGRun creation conflicts. HA schedulers use DB-level locking per DAG to allow multiple scheduler processes.

Task State Machine

none → scheduled → queued → running → success
                                    ↘ failed
                                    ↘ upstream_failed

upstream_failed propagates automatically: if task A fails, all tasks that depend on A are marked upstream_failed without being executed. This prevents partial pipeline runs from silently producing incomplete output.

Backfill

When a new DAG is added or a pipeline is fixed after a bug, historical data must be reprocessed. Backfill creates DagRun instances for each missing logical date between start_date and today:

Runs are created for every missing interval (daily DAG missing 30 days → 30 runs)
max_active_runs limits concurrent execution (e.g., 3 at a time) to avoid overloading the database or downstream systems
Runs are processed in order — oldest first — to maintain data dependencies

Because tasks use logical date parameterization, backfill is safe to re-run multiple times.

Dependency Types

Task-level: task C runs after tasks A and B in the same DAGRun
DAG-level: daily_summary DAG waits for hourly_etl DAG to complete for the same data interval using ExternalTaskSensor
Dataset-based: DAG is triggered when a dataset (logical data asset) is updated by another DAG, decoupling schedule from data availability

Failure Recovery

Options when a task fails:

Automatic retry: Configure retries=3 with retry_delay=timedelta(minutes=5). Exponential backoff optional.
Mark success: Manually mark a failed task as succeeded to unblock downstream tasks. Use when the failure is known benign or the output was produced by other means.
Clear and re-run: Reset task state to none to re-run from that point forward. Downstream tasks are also cleared and re-run.

SLA Monitoring

Each task can define an SLA: expected completion time from the DAG's logical date. If daily_report should complete by 8 AM but is still running at 8:05 AM, the SLA miss triggers:

Email alert to the pipeline owner
Slack notification via callback
SLA miss recorded in the metadata DB for reporting

Execution Pools and XCom

Pools limit concurrent resource-intensive tasks: a db_pool with size 5 ensures at most 5 tasks touch the production database simultaneously, regardless of how many runs are active.

XCom (cross-communication) allows tasks to pass small values to downstream tasks via the metadata DB. Task A pushes a row count, task B pulls it for a data quality check. XCom is not for large data — use shared storage (S3, GCS) for that.

Executor Types

LocalExecutor: Tasks run as subprocesses on the scheduler host. Simple, no external dependencies, limited scale.
CeleryExecutor: Tasks queued to Celery workers (Redis or RabbitMQ broker). Horizontally scalable worker fleet.
KubernetesExecutor: Each task runs in its own Kubernetes pod. Complete resource isolation, auto-scaling, no idle workers.

{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How does a DAG-based orchestrator schedule tasks with dependencies?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The orchestrator performs a topological sort of the DAG and maintains a set of 'ready' tasks whose upstream dependencies have all succeeded; as each task completes, the scheduler re-evaluates downstream nodes and enqueues newly unblocked ones. Task state transitions (queued → running → success/failed) are persisted to a metadata database so the scheduler can resume correctly after a restart.”
}
},
{
“@type”: “Question”,
“name”: “How does backfill reprocess historical data partitions?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A backfill run instantiates the DAG for each historical execution date between a start and end range, injecting the logical date as a parameter so tasks operate on the correct data partition (e.g., S3 prefix 'dt=2024-01-15'). Parallelism is controlled by a max_active_runs limit to avoid overwhelming upstream systems while still processing multiple partitions concurrently.”
}
},
{
“@type”: “Question”,
“name”: “How does an SLA miss alert work in a pipeline orchestrator?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Each DAG or task defines an expected_duration or sla_miss_callback; the scheduler's SLA checker runs periodically and compares the task's start time plus the SLA deadline against the current time, firing a callback (email, PagerDuty, Slack webhook) if the task is still running past the deadline. Pre-emptive SLA warnings can also be triggered when a task's elapsed time exceeds a configurable fraction of its historical p95 runtime.”
}
},
{
“@type”: “Question”,
“name”: “How does the KubernetesExecutor isolate task workloads?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The KubernetesExecutor launches a dedicated Pod per task using the Kubernetes API, so each task runs in its own container with independent CPU/memory limits, environment variables, and secrets, eliminating resource contention between tasks that the CeleryExecutor shares on a worker. Pod templates can be customized per task to specify different Docker images, GPU node selectors, or volume mounts, enabling heterogeneous workloads within a single DAG.”
}
}
]
}