Low Level Design: Data Pipeline Architecture

⏱ 7 min read

Pipeline Types: Batch, Micro-Batch, and Streaming

Data pipelines exist on a spectrum from latency to throughput:

Batch ETL: Scheduled Spark jobs (hourly, daily) read from source systems, transform, and write to a data warehouse. High throughput, high latency. Appropriate when freshness requirements are measured in hours, not seconds. Simple to reason about — each run processes a bounded dataset.
Micro-batch: Spark Structured Streaming processes data in small, frequent batches (seconds to minutes). Looks like streaming from the consumer’s perspective but retains Spark’s batch execution model. Easier to operate than true streaming; less latency-optimal.
Real-time streaming: Apache Flink processes events one at a time as they arrive, with sub-second latency. True event-time semantics, stateful operators, and exactly-once processing guarantees. Higher operational complexity; required when freshness is measured in milliseconds (fraud detection, real-time bidding, alerting).

The right choice depends on latency SLOs, data volume, and team operational capacity. Most organizations maintain all three tiers for different use cases.

Lambda Architecture

Lambda architecture addresses the tension between correctness (batch) and latency (streaming) by running both in parallel:

Batch layer: Periodically recomputes results over the full historical dataset. Slow (hours) but correct — handles late-arriving data, corrects errors, uses exact algorithms. Stores immutable master data.
Speed layer: A streaming pipeline processes events in real time, producing approximate results with low latency. Only handles recent data (since the last batch run). May use approximate algorithms (HyperLogLog, Count-Min Sketch) for efficiency.
Serving layer: Merges batch views and speed layer views for queries. Recent events come from the speed layer; historical data comes from the batch layer. Together they provide a complete, low-latency view.

Lambda’s critical weakness: two code paths for the same business logic. The batch job and the streaming job must produce equivalent results, but they’re written in different frameworks (Spark vs. Flink), often by different teams. Keeping them in sync as business logic evolves is expensive. Bugs that exist only in the speed layer (not caught by batch reconciliation) cause subtle data quality issues.

Kappa Architecture

Kappa architecture, proposed by Jay Kreps (Kafka co-creator), eliminates the batch layer entirely:

Single streaming pipeline: All data processing — both real-time and historical — is done by the streaming layer. One code path, one framework.
Historical reprocessing via Kafka replay: When logic changes or data needs correction, replay events from the beginning of the Kafka topic. The streaming job reprocesses history at high throughput, producing corrected output. This requires Kafka to retain data long enough for full replay (weeks to months of retention, or tiered storage).

Kappa’s advantage is simplicity: one codebase, one operational system, no reconciliation between batch and speed layers. The trade-off is that reprocessing large historical datasets through a streaming system can be slow and resource-intensive. Kafka tiered storage (offloading old data to object storage) makes long retention economically viable. Kappa is now the dominant architecture for teams using Flink or Kafka Streams as their primary processing engine.

Data Quality Checks

Pipelines without quality checks are liability generators. Quality enforcement runs at multiple levels:

Schema validation: Verify that input data conforms to the expected schema. Reject or quarantine records with unexpected columns, wrong types, or missing required fields.
Null checks: Assert that columns expected to be non-null are not null. A 5% null rate on a primary key field signals an upstream bug.
Range checks: Values must fall within expected bounds. An age field of 500 or a price of -$10 indicates data corruption.
Referential integrity: Foreign keys must resolve. Orders referencing non-existent customer IDs indicate a join that missed records.
Deduplication: Detect and remove duplicate records caused by at-least-once delivery semantics in upstream systems.
Row count anomaly detection: Statistical anomaly detection on input/output row counts. A 90% drop in daily events is always a bug or an outage — catch it before the dashboard shows zeros.

Quality checks are implemented as assertions in the pipeline code or as separate validation jobs (Great Expectations, dbt tests). On violation: either fail the pipeline (blocking bad data from propagating) or route bad records to a dead-letter queue for investigation. Which strategy depends on how much data volume you can afford to lose vs. how much corruption you can tolerate in the output.

Schema Evolution

Schemas change. The question is whether those changes break existing consumers:

Backward compatible: New schema can read data written with old schema. Adding an optional field with a default is backward compatible — old consumers ignore the new field; new consumers get the default for old records.
Forward compatible: Old schema can read data written with new schema. Consumers ignore unknown fields they don’t recognize.
Full compatibility (both): The safest constraint. Adding optional fields achieves full compatibility. Renaming, removing, or changing types of required fields does not.

Avro Schema Registry (Confluent) enforces compatibility rules centrally. Every message is tagged with a schema ID. When a producer registers a new schema version, the registry checks compatibility against the current version before accepting it. Consumers fetch the schema by ID from the registry to deserialize messages. This prevents the classic problem of a producer deploying a breaking schema change and silently corrupting downstream consumers.

Idempotent Pipelines

Pipelines fail. Reprocessing must produce the same output as the original run — otherwise reruns cause double-counting or other corruption. Idempotency strategies:

Upsert semantics: Write operations use INSERT … ON CONFLICT UPDATE (SQL) or merge operations. Re-inserting an existing record updates it to the same value rather than duplicating it.
Partition overwrites: For batch pipelines writing to partitioned tables (e.g., date-partitioned), overwrite the entire partition rather than appending. Rerunning a daily job for 2024-01-15 replaces the partition, not duplicates it.
Idempotency keys: Include a deterministic unique ID in each output record derived from the input. Downstream systems can detect and discard duplicates by checking if the key already exists.

Idempotency is non-negotiable for production pipelines. Exactly-once semantics in streaming (Flink, Kafka transactions) provide stronger guarantees but at higher overhead. At-least-once + idempotent sinks is often a more pragmatic pattern.

Orchestration with Airflow

Apache Airflow is the de facto standard for batch pipeline orchestration. Core concepts:

DAG (Directed Acyclic Graph): A pipeline is defined as a DAG of tasks. Edges define dependencies — task B runs only after task A succeeds. DAGs are defined in Python code and version-controlled.
Scheduling: DAGs run on a cron-like schedule. Each scheduled run is a DAG run with a logical execution date. Airflow distinguishes between execution date (the period being processed) and run date (when the DAG actually ran) — important for data partitioning.
Retries and SLA alerts: Tasks can retry on failure with configurable backoff. SLA misses (tasks running longer than expected) trigger alerts via email or PagerDuty.
Backfill: Run a DAG for historical execution dates. airflow dags backfill --start-date 2024-01-01 --end-date 2024-01-31 my_dag processes 31 days of data, respecting task dependencies. Essential for bootstrapping new pipelines or recovering from outages.

Airflow’s weakness: it’s a scheduler and monitor, not an execution engine. It tells Spark to run a job but doesn’t run the Spark job itself. Scaling Airflow itself (the scheduler, workers, metadata database) is an operational burden for large organizations. Alternatives include Prefect, Dagster, and cloud-native orchestrators (AWS Step Functions, GCP Workflows).

Pipeline Observability and Data Contracts

Observability for data pipelines goes beyond "did the job complete":

Data freshness SLO: Define and alert on maximum acceptable age for each dataset. If the orders table hasn’t been updated in 2 hours, page the on-call engineer.
Row count anomaly detection: Track daily row counts per table. Flag drops or spikes beyond historical norms (e.g., 3-sigma deviation). Monte Carlo, Bigeye, and dbt tests implement this.
Schema drift alerts: Alert when columns are added, removed, or type-changed in source data. Schema drift is the #1 cause of silent data quality degradation.
Data lineage: Track which pipeline produced each dataset, from which sources, at what time. OpenLineage (open standard) + Marquez (lineage server) provide lineage collection and visualization. Lineage is essential for impact analysis: "if I change this source table, which 47 downstream dashboards break?"

Data contracts formalize the agreement between data producers and consumers: the schema, freshness SLA, and null/range constraints are documented and enforced. When a producer wants to make a breaking change, the contract forces coordination with downstream consumers before deployment. This shifts data quality left — catching issues at production time rather than when analysts discover corrupted dashboards.

dbt: Transformation as Code

dbt (data build tool) has become the standard for SQL-based data transformations in the warehouse. Key capabilities:

Modular SQL models: Each model is a SELECT statement saved as a .sql file. dbt handles DDL — it creates or replaces the target table/view. Models reference each other with {{ ref('model_name') }}, building an implicit DAG.
Tests: Built-in tests (not_null, unique, accepted_values, relationships) run as assertions after each build. Custom tests are SQL queries that return rows on failure.
Documentation: Column descriptions live in YAML files alongside SQL. dbt generates a searchable documentation site from these descriptions, including the lineage graph.
Lineage graph: dbt computes the full DAG of model dependencies and renders it visually. Instantly shows what breaks if a source model changes.

dbt runs in the warehouse — it pushes SQL to Snowflake, BigQuery, Redshift, or Databricks and lets the warehouse engine handle execution. This means it benefits from warehouse-scale compute without managing a separate Spark cluster for transformations. The combination of dbt (transformations) + Airflow (orchestration) + dbt tests (quality) + OpenLineage (lineage) covers most of a modern data platform’s operational needs.