A data pipeline moves data from operational systems (databases, event streams, APIs) to analytical systems (data warehouses, data lakes) where it can be queried for business intelligence, reporting, and ML training. ETL (Extract, Transform, Load) and its modern variant ELT (Extract, Load, Transform) are the core patterns. Apache Airflow, dbt, Fivetran, and AWS Glue are common pipeline tools. Understanding pipeline architecture is essential for data engineering and backend system design interviews.
ETL vs. ELT
ETL (Extract, Transform, Load): data is transformed before loading into the destination. The transformation happens in a separate compute layer (a Python script, Spark job, or custom middleware). Advantages: destination only receives clean, validated data. Disadvantages: transformations must be defined upfront; raw data is not retained in the destination (hard to re-derive different transformations later). Traditional approach for data warehouses with limited storage and compute (Teradata era). ELT (Extract, Load, Transform): raw data is loaded into the destination first; transformations are applied in-place using the warehouse’s own compute. dbt (data build tool) executes SQL transformations inside BigQuery, Redshift, or Snowflake. Advantages: raw data is always available for re-transformation; modern cloud warehouses are cheap and powerful enough to transform in-place; transformations are version-controlled SQL models. Disadvantages: destination contains both raw and transformed data (storage cost). ELT is the modern standard for cloud data warehouses. ETL is still appropriate for: streaming pipelines where data must be transformed before writing to a low-latency serving store, and data quality enforcement before landing in a shared data lake.
Change Data Capture (CDC)
Incrementally syncing data from an operational database to a warehouse without full table scans. CDC reads the database’s replication log (MySQL binlog, PostgreSQL WAL) to capture every INSERT, UPDATE, and DELETE as an event. Debezium is the standard open-source CDC tool: it connects as a replica to the source database and streams change events to Kafka. Each change event contains: before (old row values), after (new row values), operation type (INSERT/UPDATE/DELETE), transaction ID, LSN (log sequence number). The Kafka consumer writes these events to the destination (BigQuery, Snowflake) incrementally. Advantages over full table scans: (1) No load on the source database — reads the replication log, not the main tables. (2) Sub-minute latency — changes appear in the warehouse within seconds. (3) Captures deletes — full table scans can’t detect deleted rows. CDC challenges: schema changes in the source break the CDC pipeline if not handled (Debezium supports schema evolution detection). Initial snapshot: before starting CDC, take a consistent full snapshot of the table; then apply CDC events on top.
Orchestration with Apache Airflow
Complex pipelines have multiple steps with dependencies: extract from 5 sources → validate → transform → load → run dbt models → update dashboard cache. Airflow orchestrates this as a DAG (Directed Acyclic Graph) of tasks. Each node in the DAG is a task (Python function, SQL query, Spark job, Bash command); edges define dependencies (task B runs after task A completes). Airflow scheduler: runs every minute, checks which tasks are due (their dependencies complete and their schedule time has passed), and submits them to a task queue (Celery or Kubernetes). Workers pick up tasks and execute them. DAG features: (1) Retry logic: retry failed tasks N times with exponential backoff. (2) SLA alerts: notify if a DAG doesn’t complete within its SLA (e.g., the morning ETL must finish by 8am for the business dashboard). (3) Backfill: if a pipeline was down for 3 days, rerun all missed DAG runs for the missed dates. (4) Data lineage: track which tasks consume and produce which datasets. Alternatives to Airflow: Prefect, Dagster (better developer experience, testing), dbt (for transformation-only orchestration).
Data Quality and Validation
Pipelines that silently corrupt data are worse than broken pipelines (corrupted data is used for decisions before the problem is noticed). Data quality checks at every stage: (1) Schema validation: verify the incoming data matches the expected schema (column names, types). Alert and halt if unexpected columns appear or types change — often a signal of an upstream schema change. (2) Null checks: verify columns expected to be non-null aren’t null (order_id must never be null). (3) Row count checks: the number of rows today must be within X% of the 7-day average. A sudden 90% drop indicates a source system issue. (4) Distribution checks: verify numeric distributions haven’t shifted dramatically (revenue per transaction average shouldn’t change by 50% overnight without a known cause). Great Expectations and dbt tests are standard frameworks for data quality assertions. Data contracts: define an agreement between the producer (upstream team) and consumer (downstream pipeline) about schema, freshness, and expected volumes. Violations of the contract trigger alerts to both teams.
Incremental Processing and Idempotency
Pipelines must be idempotent: re-running a failed pipeline for the same time period should produce the same result, not duplicate data. Full replacement: for each partition (e.g., date partition), DELETE existing data for that partition, then INSERT the new data. Re-running produces the same result. Use this for small partitions or when data is recomputed from scratch each run. Upsert (MERGE): for each processed row, INSERT if the key doesn’t exist, UPDATE if it does. Handles late-arriving updates and corrections. Requires a unique key per row. BigQuery MERGE statement and dbt incremental models implement this pattern. Watermark-based incremental: track the last processed timestamp or offset per source. On each run: process only records newer than the watermark. Update the watermark after successful processing. Efficient for high-volume sources where full reprocessing is too slow. Risk: records with timestamps in the past (late data) are missed if they arrive after the watermark has advanced past their timestamp. Use a lookback window (process the last 24 hours, not just since last run) to catch late arrivals at the cost of reprocessing recent data.