System Design Interview: Data Pipeline and ETL System (Airflow / Spark)

⏱ 8 min read

What Is a Data Pipeline?

A data pipeline moves and transforms data from source systems (databases, APIs, event streams) to destination systems (data warehouses, data lakes, ML feature stores). ETL (Extract, Transform, Load) and ELT (Extract, Load, Transform) are the two dominant patterns. ETL transforms data before loading — used when transformation is complex or the destination has limited compute. ELT loads raw data first and transforms in the warehouse — preferred in modern cloud architectures where compute (BigQuery, Snowflake) is cheap and raw data is valuable for re-processing. This question appears at data-heavy companies: Databricks, Snowflake, Palantir, Airbnb, and any company with a significant analytics or ML infrastructure.

Batch vs Streaming

Batch pipelines process bounded datasets on a schedule — daily revenue aggregations, weekly user cohort analysis. Advantages: simple to reason about, easy to reprocess historical data, mature tooling (Spark, Hive). Latency: minutes to hours. Streaming pipelines process unbounded data continuously — real-time fraud detection, live dashboard metrics. Advantages: low latency (seconds to milliseconds), enables real-time products. Complexity: harder to handle late-arriving data, exactly-once semantics, and backpressure. Most production systems use both: a batch layer for historical accuracy and a streaming layer for real-time freshness (Lambda architecture), or the Kappa architecture (streaming-only, with replayable Kafka topics as the historical record).

Apache Airflow: DAG Scheduling

Airflow orchestrates batch pipelines as DAGs (Directed Acyclic Graphs) of tasks. A DAG defines the dependency order — Task B runs only after Task A completes. Airflow scheduler assigns tasks to workers based on dependencies, retries failed tasks with configurable backoff, and tracks task state (queued, running, success, failed, skipped). Key concepts: (1) Operators: pre-built task types — BashOperator (run a shell script), PythonOperator (call a Python function), SparkSubmitOperator (submit a Spark job), S3ToRedshiftOperator (load S3 data). (2) Sensors: wait for an external condition — S3KeySensor waits for a file to arrive. (3) XCom: lightweight key-value store for passing small data between tasks. Large data should be passed via S3, not XCom. (4) Connections and variables: credentials and config stored in Airflow, injected into tasks at runtime.


from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.providers.apache.spark.operators.spark_submit import SparkSubmitOperator
from datetime import datetime, timedelta

with DAG(
    dag_id="daily_revenue_pipeline",
    schedule_interval="0 3 * * *",   # 3 AM daily
    start_date=datetime(2024, 1, 1),
    catchup=False,
    default_args={"retries": 2, "retry_delay": timedelta(minutes=5)},
) as dag:

    extract = PythonOperator(
        task_id="extract_from_postgres",
        python_callable=extract_revenue_data,
    )

    transform = SparkSubmitOperator(
        task_id="transform_with_spark",
        application="s3://bucket/jobs/revenue_transform.py",
        conn_id="spark_default",
    )

    load = PythonOperator(
        task_id="load_to_snowflake",
        python_callable=load_to_warehouse,
    )

    extract >> transform >> load  # dependency chain

Fault Tolerance and Idempotency

Pipeline failures are inevitable — network timeouts, OOM in Spark, source database downtime. Design for idempotent tasks: running a task twice produces the same result as running it once. Idempotency patterns: (1) Partition by date: write output to a date-partitioned path (s3://bucket/revenue/date=2024-01-15/). On retry, overwrite the same partition — no duplicates. (2) MERGE / UPSERT: in the destination database, use MERGE INTO that updates existing rows and inserts new ones based on a business key. (3) Watermarking: track the last processed record ID or timestamp. On retry, start from the watermark — skip already-processed records. Checkpointing: Spark Structured Streaming writes checkpoints to S3 after each micro-batch, enabling recovery without reprocessing already-committed data.

Data Quality

Data quality checks should run as pipeline tasks, not afterthoughts. A Great Expectations or dbt-test check after each transformation verifies: row count (today has at least 90% of yesterday rows), null rate (critical fields have less than 1% nulls), referential integrity (all foreign keys exist in the dimension table), value ranges (revenue is non-negative), uniqueness (no duplicate transaction IDs). If a check fails, the task fails and downstream tasks do not execute — preventing bad data from reaching the warehouse. Anomaly detection: compare today metrics to a rolling 7-day average; flag deviations beyond 3 standard deviations. Quality metrics are tracked over time to identify trends (slowly worsening data quality from a source system).

Schema Evolution

Source schemas change over time — new columns added, columns renamed, types changed. Strategies: (1) Schema registry: use Confluent Schema Registry with Avro or Protobuf for Kafka-based pipelines. The registry enforces compatibility rules (backward, forward, or full compatibility) so schema changes do not break consumers. (2) Schema-on-read: store raw data in Parquet or JSON on S3 with no enforced schema. Downstream processing reads and validates what it needs — tolerates additive changes. (3) Column mapping: maintain a mapping between source column names and destination column names. Renaming a source column is a config change, not a pipeline change. (4) Versioned schemas: run multiple pipeline versions in parallel when a breaking change occurs — old version serves existing consumers, new version serves migrated consumers. Decommission the old version once all consumers migrate.

Interview Tips

Batch vs streaming: clarify the latency requirement before designing
Idempotency is critical — explain partition overwrite or MERGE patterns
Airflow DAG is the expected orchestration answer for batch pipelines
Data quality as pipeline tasks (not post-hoc) shows production maturity
Schema evolution: mention schema registry for streaming, schema-on-read for batch

Frequently Asked Questions

What is the difference between batch processing and stream processing?

Batch processing accumulates data over a fixed time window (hourly, daily) and processes it as a single job. It is simple to implement, easy to re-run on failures, and highly efficient for large volumes — tools like Spark process terabytes by partitioning data across hundreds of nodes. The tradeoff is latency: a nightly batch job means business decisions lag 24 hours behind reality. Stream processing handles each event as it arrives, typically within seconds or milliseconds. Tools like Apache Flink and Kafka Streams maintain stateful operators in memory and process events continuously. Stream processing is complex — you must handle out-of-order events (watermarks), exactly-once semantics, and state management across restarts. Use batch for: daily reports, model training, bulk transformations. Use streaming for: fraud detection, real-time dashboards, alerting, personalization where staleness is costly.

How does Airflow schedule and execute DAG tasks reliably?

Airflow represents each pipeline as a DAG (Directed Acyclic Graph) of tasks with explicit dependencies. The scheduler continuously scans DAG definitions, computes which task instances are ready (all upstream dependencies succeeded), and pushes them to a task queue (Celery or Kubernetes). Workers pull tasks from the queue and execute them. Reliability comes from: (1) Retry policies — each task has configurable retries with exponential backoff; (2) Idempotency — tasks that re-read from source and overwrite destination can be safely retried without double-counting; (3) State tracking — every task instance state (queued, running, success, failed, skipped) is stored in a PostgreSQL or MySQL database, enabling resume from the exact failure point; (4) SLA monitoring — tasks that exceed their expected duration fire alerts before the deadline is missed. The scheduler itself is a potential single point of failure; production deployments use a high-availability scheduler with multiple instances using database row-level locks to prevent duplicate scheduling.

How do you ensure data quality in an ETL pipeline?

Data quality checks run at three stages: (1) Ingestion validation — reject records that fail schema validation (wrong types, missing required fields) at the point of arrival, before they pollute downstream tables. Dead-letter queues capture rejected records for manual review. (2) Post-transform assertions — after transformation, run statistical checks: row count within expected range (not suddenly 0 or 10x normal), no null values in non-nullable columns, referential integrity between joined tables, value distributions within historical bounds (a sudden shift in average order value signals upstream data corruption). Tools like Great Expectations define these assertions as code and run them as pipeline tasks. (3) Output validation — compare aggregated metrics against a control: if today's daily revenue total deviates from the 30-day moving average by more than 3 standard deviations, pause the load and alert. Failing any check triggers the task to fail, preventing bad data from reaching production dashboards.

Cloudflare Interview Guide

Meta Interview Guide

Twitter/X Interview Guide

Databricks Interview Guide

Companies That Ask This Question

Asked at: Uber Interview Guide