Low Level Design: Data Pipeline Design

⏱ 4 min read

A data pipeline moves data from source systems (production databases, event streams, APIs) through transformation stages to analytical destinations (data warehouses, ML feature stores, reporting databases). Well-designed pipelines are reliable (handle failures gracefully), scalable (process growing data volumes without redesign), and maintainable (observable, testable, and evolvable). Apache Airflow, dbt, Apache Flink, Spark, and Kafka Streams are the primary tools. Understanding data pipeline design is increasingly relevant for system design interviews at data-intensive companies.

ETL vs ELT Architecture

ETL (Extract-Transform-Load): extract data from source, transform it in a dedicated processing layer (Spark, custom Python), load the transformed data into the destination warehouse. Transformations happen outside the warehouse — the warehouse stores clean, processed data only. Good for complex transformations requiring custom code or when the warehouse is expensive. ELT (Extract-Load-Transform): extract raw data from source, load it as-is into the warehouse (raw data layer), then transform using the warehouse SQL engine (dbt models, BigQuery queries). Raw data is preserved for reprocessing. Transformations leverage the warehouse compute (BigQuery, Snowflake, Redshift scale to petabytes). Modern data stacks (dbt + Snowflake/BigQuery) use ELT exclusively — the warehouse is cheap enough that storing raw data is cost-effective, and SQL transformations are maintainable and testable.

-- dbt ELT transformation example
-- Raw layer: loaded as-is from source
-- Staging layer: clean and type-cast raw data
-- Marts: business logic aggregations

-- models/staging/stg_orders.sql
SELECT
    order_id::UUID AS order_id,
    user_id::BIGINT AS user_id,
    created_at::TIMESTAMP AS created_at,
    status::TEXT AS status,
    total_amount::NUMERIC(10,2) AS total_amount,
    -- Clean nulls, standardize casing
    COALESCE(LOWER(country_code), 'us') AS country_code
FROM raw.orders
WHERE created_at >= '2020-01-01'  -- filter old test data

-- models/marts/daily_revenue.sql (incremental model)
{{config(materialized='incremental', unique_key='sale_date')}}
SELECT
    DATE_TRUNC('day', created_at) AS sale_date,
    country_code,
    COUNT(*) AS order_count,
    SUM(total_amount) AS revenue
FROM {{ref('stg_orders')}}
WHERE status = 'completed'
{% if is_incremental() %}
    AND created_at >= (SELECT MAX(sale_date) FROM {{this}})
{% endif %}
GROUP BY 1, 2

Pipeline Orchestration: Apache Airflow

Airflow orchestrates data pipelines as DAGs (Directed Acyclic Graphs) of tasks. Each task is a unit of work (extract a table, run a dbt model, trigger a Spark job). Airflow manages scheduling (run daily at 2am), dependency ordering (task B runs after task A), retries (retry 3 times with exponential backoff), and alerting (email on failure). Sensors wait for external conditions (wait for a file to land in S3). Key concepts: DAG = one pipeline definition; Task = one node in the DAG; Operator = task type (PythonOperator, BashOperator, BigQueryOperator, SparkSubmitOperator); XCom = small data passed between tasks within a DAG run. Pitfalls: Airflow is an orchestrator, not a data processor — do not process large datasets in Airflow tasks, trigger Spark or dbt to do the heavy lifting. Airflow state is stored in a PostgreSQL metadata database — backup this database for disaster recovery.

Streaming Pipelines: Kafka + Flink

Batch pipelines process data in scheduled intervals (hourly, daily), introducing latency. Streaming pipelines process events in real-time as they arrive. Architecture: producers write events to Kafka topics; Flink (or Spark Structured Streaming) reads from Kafka, applies transformations (filter, aggregate, join), and writes results to sinks (database, another Kafka topic, Elasticsearch). Flink features relevant to pipelines: windowed aggregations (count events per 5-minute tumbling window), stateful processing (maintain running counts without reprocessing from scratch), exactly-once semantics (with Kafka transactional APIs), watermarks for late data handling (process an event timestamped 2 minutes ago even if it arrives 10 minutes late). Use streaming pipelines for: real-time dashboards, fraud detection, live recommendation updates, alerting on metrics thresholds.

Key Interview Discussion Points

Idempotency and reruns: pipelines must be safe to rerun (a failed daily run reprocessed the next morning should not double-count data); use INSERT OR REPLACE, MERGE, or natural keys with UPSERT semantics to make reruns idempotent
Data quality monitoring: test row counts (expected range), null rates (no required field should be null), value distributions (revenue should not be negative), and referential integrity (every order_id in fact table must exist in dimension table); dbt tests, Great Expectations, and Soda implement data quality checks
Backfill: processing historical data after a pipeline is first deployed or after a bug fix; streaming pipelines require Kafka retention (keep data long enough for backfill), or replay from S3/GCS event store
Change Data Capture (CDC): instead of full table exports, use CDC tools (Debezium) to capture only changed rows from database transaction logs (WAL in PostgreSQL, binlog in MySQL) and stream changes to Kafka in real-time
Schema management: data pipelines are brittle to schema changes in source systems; use schema registries (Confluent Schema Registry) for Kafka Avro, or schema-on-read strategies (store raw JSON and cast on read) to decouple pipelines from source schema evolution