What an Orchestrator Does
A data pipeline orchestrator schedules and executes directed acyclic graphs (DAGs) of computational tasks. Each DAG represents a workflow: extract data from source, transform it, load to destination. The orchestrator handles dependency resolution, retry logic, backfill of historical data, and alerting when pipelines miss their SLAs. Apache Airflow is the canonical reference implementation.
DAG Definition
Pipelines are defined as code (Python) or config (YAML), specifying:
- Tasks: units of work (SQL query, Python function, Spark job, HTTP call)
- Dependencies: directed edges — task C runs only after A and B succeed
- Schedule: cron expression defining when the DAG runs (
0 6 * * *= daily at 6 AM) - Start date: the earliest logical date for which a run should be created
DAG code is parsed by the scheduler process periodically. Changes take effect on the next parse cycle without restart.
Time Partitioning and Logical Date
Each DAG run is associated with a logical date (also called execution date or data interval start). A daily DAG scheduled at 0 6 * * * with logical date 2024-01-15 processes data for January 15th. The logical date is available to tasks as a template variable {{ ds }}:
SELECT * FROM events
WHERE date = '{{ ds }}'
This parameterization makes every DAG run idempotent — the same logical date always processes the same data partition.
Scheduler Loop
The scheduler runs a tight loop:
- Find DAGs with
next_run_time <= nowand no active run blocking them - Create a
DagRunrecord for the new logical date - For each task in the DAG, check if all upstream dependencies are in
successstate - Queue eligible tasks to the executor
- Update task states as the executor reports completion
The scheduler is single-process (in Airflow's classic design) to avoid concurrent DAGRun creation conflicts. HA schedulers use DB-level locking per DAG to allow multiple scheduler processes.
Task State Machine
none → scheduled → queued → running → success
↘ failed
↘ upstream_failed
upstream_failed propagates automatically: if task A fails, all tasks that depend on A are marked upstream_failed without being executed. This prevents partial pipeline runs from silently producing incomplete output.
Backfill
When a new DAG is added or a pipeline is fixed after a bug, historical data must be reprocessed. Backfill creates DagRun instances for each missing logical date between start_date and today:
- Runs are created for every missing interval (daily DAG missing 30 days → 30 runs)
max_active_runslimits concurrent execution (e.g., 3 at a time) to avoid overloading the database or downstream systems- Runs are processed in order — oldest first — to maintain data dependencies
Because tasks use logical date parameterization, backfill is safe to re-run multiple times.
Dependency Types
- Task-level: task C runs after tasks A and B in the same DAGRun
- DAG-level:
daily_summaryDAG waits forhourly_etlDAG to complete for the same data interval usingExternalTaskSensor - Dataset-based: DAG is triggered when a dataset (logical data asset) is updated by another DAG, decoupling schedule from data availability
Failure Recovery
Options when a task fails:
- Automatic retry: Configure
retries=3withretry_delay=timedelta(minutes=5). Exponential backoff optional. - Mark success: Manually mark a failed task as succeeded to unblock downstream tasks. Use when the failure is known benign or the output was produced by other means.
- Clear and re-run: Reset task state to
noneto re-run from that point forward. Downstream tasks are also cleared and re-run.
SLA Monitoring
Each task can define an SLA: expected completion time from the DAG's logical date. If daily_report should complete by 8 AM but is still running at 8:05 AM, the SLA miss triggers:
- Email alert to the pipeline owner
- Slack notification via callback
- SLA miss recorded in the metadata DB for reporting
Execution Pools and XCom
Pools limit concurrent resource-intensive tasks: a db_pool with size 5 ensures at most 5 tasks touch the production database simultaneously, regardless of how many runs are active.
XCom (cross-communication) allows tasks to pass small values to downstream tasks via the metadata DB. Task A pushes a row count, task B pulls it for a data quality check. XCom is not for large data — use shared storage (S3, GCS) for that.
Executor Types
- LocalExecutor: Tasks run as subprocesses on the scheduler host. Simple, no external dependencies, limited scale.
- CeleryExecutor: Tasks queued to Celery workers (Redis or RabbitMQ broker). Horizontally scalable worker fleet.
- KubernetesExecutor: Each task runs in its own Kubernetes pod. Complete resource isolation, auto-scaling, no idle workers.
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “How does a DAG-based orchestrator schedule tasks with dependencies?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The orchestrator performs a topological sort of the DAG and maintains a set of 'ready' tasks whose upstream dependencies have all succeeded; as each task completes, the scheduler re-evaluates downstream nodes and enqueues newly unblocked ones. Task state transitions (queued → running → success/failed) are persisted to a metadata database so the scheduler can resume correctly after a restart.”
}
},
{
“@type”: “Question”,
“name”: “How does backfill reprocess historical data partitions?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “A backfill run instantiates the DAG for each historical execution date between a start and end range, injecting the logical date as a parameter so tasks operate on the correct data partition (e.g., S3 prefix 'dt=2024-01-15'). Parallelism is controlled by a max_active_runs limit to avoid overwhelming upstream systems while still processing multiple partitions concurrently.”
}
},
{
“@type”: “Question”,
“name”: “How does an SLA miss alert work in a pipeline orchestrator?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “Each DAG or task defines an expected_duration or sla_miss_callback; the scheduler's SLA checker runs periodically and compares the task's start time plus the SLA deadline against the current time, firing a callback (email, PagerDuty, Slack webhook) if the task is still running past the deadline. Pre-emptive SLA warnings can also be triggered when a task's elapsed time exceeds a configurable fraction of its historical p95 runtime.”
}
},
{
“@type”: “Question”,
“name”: “How does the KubernetesExecutor isolate task workloads?”,
“acceptedAnswer”: {
“@type”: “Answer”,
“text”: “The KubernetesExecutor launches a dedicated Pod per task using the Kubernetes API, so each task runs in its own container with independent CPU/memory limits, environment variables, and secrets, eliminating resource contention between tasks that the CeleryExecutor shares on a worker. Pod templates can be customized per task to specify different Docker images, GPU node selectors, or volume mounts, enabling heterogeneous workloads within a single DAG.”
}
}
]
}
See also: Databricks Interview Guide 2026: Spark Internals, Delta Lake, and Lakehouse Architecture
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering