Data Pipeline System Low-Level Design

What is a Data Pipeline?

A data pipeline is a sequence of data processing steps: collect → transform → store → serve. Modern pipelines distinguish batch (hourly/daily, high throughput) from streaming (real-time, low latency). The choice depends on acceptable latency: dashboards updating every second need streaming; nightly revenue reports need batch.

Architecture: Batch Pipeline

Source (DB/S3/API) → Ingestion → Storage (Data Lake: S3/GCS)
                                 → Transform (Spark/dbt)
                                 → Data Warehouse (BigQuery/Snowflake/Redshift)
                                 → Serve (Looker/Tableau/ad-hoc SQL)

ETL (Extract-Transform-Load) vs ELT (Extract-Load-Transform): ETL transforms before loading (traditional, good for data quality enforcement); ELT loads raw then transforms in the warehouse (modern cloud DWH approach — cheap storage, powerful SQL transforms). dbt (data build tool) is the standard for ELT transforms: SQL + Jinja templates, incremental models, lineage tracking.

Architecture: Streaming Pipeline

Event Source → Kafka → Stream Processor (Flink / Kafka Streams)
                     → Serving DB (Druid / ClickHouse / Redis)
                     → Alert System (PagerDuty / Slack)
                     → S3 (raw archive, for batch reprocessing)

Data Model: Event Schema

Event(event_id UUID, event_type VARCHAR, user_id, session_id,
      properties JSONB, client_ts TIMESTAMP, server_ts TIMESTAMP,
      app_version, platform, ingested_at)

Use a schema registry (Confluent Schema Registry or AWS Glue) to enforce schema versions. Avro or Protobuf formats for efficient serialization. JSON is human-readable but 3-10x larger than binary formats.

Ingestion Layer

Client SDKs batch events locally (10 events or 5 seconds, whichever first) before sending to reduce API load. Collector API: validate schema, enrich (add server_ts, geo from IP, device type), publish to Kafka. Return 200 immediately — never block on processing. Kafka partition by user_id for ordered per-user processing. Dead-letter topic for malformed events that fail validation.

Transformation Patterns

  • Sessionization: group events by session (30-min inactivity gap). Flink session windows or offline SQL (GROUP BY user_id + session_id).
  • Deduplication: clients may retry failed sends. Deduplicate by event_id (UUID): INSERT … ON CONFLICT (event_id) DO NOTHING, or filter in stream processor by tracking seen event_ids with a Bloom filter (TTL=1hr).
  • Enrichment: join events with dimension tables (user profile, product catalog). In batch: SQL JOIN in the warehouse. In streaming: Flink async I/O to fetch user data from Redis.
  • Aggregation: TUMBLE windows (fixed, non-overlapping: 1-minute buckets), HOP windows (overlapping: 5-minute window sliding every 1 minute), SESSION windows (gap-based).

Data Quality

  • Schema validation: reject events with missing required fields at ingestion
  • Volume checks: alert if event volume drops by >20% vs same hour yesterday (indicates SDK bug or outage)
  • Null rate monitoring: alert if key fields have null rate > threshold
  • End-to-end latency: measure time from client_ts to available in serving layer; alert if >10 minutes
  • dbt tests: NOT NULL, UNIQUE, referential integrity checks run on each pipeline run

Orchestration

Airflow (or Prefect, Dagster): define DAGs of tasks with dependencies. Airflow schedules batch jobs, handles retries, sends alerts on failure. Key patterns: sensor tasks (wait for upstream S3 file before starting transform), dynamic task mapping (one task per partition), SLA monitoring (alert if DAG doesn’t complete within expected window). For streaming: Flink jobs are long-running processes managed by Kubernetes or Flink standalone cluster. Monitor via Flink dashboard + Prometheus metrics.

Key Design Decisions

  • Kafka as the central event bus — decouples all producers from all consumers
  • Raw event archive in S3 (immutable) — enables reprocessing when downstream bugs are found
  • ELT over ETL for cloud DWH — cheaper and more flexible, transforms can be changed without re-ingestion
  • Schema registry — prevents silent schema breaks from cascading through the pipeline


{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”What is the difference between ETL and ELT in data pipelines?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”ETL (Extract-Transform-Load): data is extracted from sources, transformed (cleaned, joined, aggregated) in a separate processing layer, then loaded into the destination. Traditional approach, used when the destination has limited compute or when transformation must happen before storage for compliance. ELT (Extract-Load-Transform): raw data is loaded into the data warehouse first, then transformed using SQL inside the warehouse. Modern cloud DWH approach (BigQuery, Snowflake, Redshift). Benefits: raw data is preserved for reprocessing, transforms can be changed without re-ingesting data, warehouses have enormous SQL compute power. dbt (data build tool) is the standard ELT transformation layer: SQL + Jinja templates, incremental models, built-in lineage tracking and testing.”}},{“@type”:”Question”,”name”:”How does Kafka serve as the backbone of a data pipeline?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Kafka decouples all producers from all consumers. Producers (application servers, microservices) publish events to Kafka topics without knowing who will consume them. Multiple consumers (stream processors, batch consumers, audit loggers) read independently at their own pace. Key properties: (1) Durability — events are persisted to disk for a configurable retention period (default 7 days), enabling replay. (2) Ordering — events within a partition are ordered. Partition by user_id for per-user ordering. (3) High throughput — millions of events/second. (4) Consumer groups — multiple instances of the same consumer share the load (each partition is consumed by one instance). For the data pipeline, Kafka sits between the ingestion API and all downstream processing: Flink for streaming, Spark for batch, and the raw S3 archive.”}},{“@type”:”Question”,”name”:”How does deduplication work in a streaming data pipeline?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Client SDKs retry failed sends, producing duplicate events. Each event has a UUID event_id. Deduplication strategies: (1) Exactly-once semantics in Kafka (enable.idempotence=true + transactional producers/consumers) — prevents duplicates within the Kafka layer. (2) Stream processor deduplication: in Flink, use a KeyedProcessFunction that tracks seen event_ids in RocksDB state with a TTL (e.g., 1 hour). If event_id already seen, drop the event. TTL prevents unbounded state growth. (3) Database deduplication: INSERT … ON CONFLICT (event_id) DO NOTHING — the unique index on event_id prevents duplicate rows. Use option 3 as the final safety net; option 2 for stream processing. Option 1 for inter-service Kafka guarantees.”}},{“@type”:”Question”,”name”:”What are tumble, hop, and session windows in stream processing?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Windows define how to group events by time for aggregation. TUMBLE (fixed, non-overlapping): events are grouped into fixed-size buckets. Example: 1-minute tumbling window — events at 12:00:00-12:00:59 form one bucket, 12:01:00-12:01:59 the next. Use for: per-minute metrics, billing aggregations. HOP (sliding window): fixed size window that advances by a smaller step. Example: 5-minute window, 1-minute hop — at 12:05, the window covers 12:00-12:05; at 12:06, it covers 12:01-12:06. Events appear in multiple windows. Use for: rolling averages, moving metrics. SESSION (gap-based): a window closes after a period of inactivity. Example: 30-minute gap — all events within 30 minutes of each other form one session. No fixed size. Use for: user session analytics, journey analysis.”}},{“@type”:”Question”,”name”:”How do you monitor data quality in a production pipeline?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Five key data quality dimensions: (1) Volume — alert if event count drops >20% vs the same time yesterday. Sudden drop indicates SDK bug, deployment issue, or data source outage. (2) Freshness — measure end-to-end latency from client_ts to available in serving layer. Alert if >10 minutes. (3) Completeness — monitor null rates on key fields (user_id, event_type). Alert if null rate exceeds threshold. (4) Schema validity — track schema validation failure rate. Rising failure rate means a client SDK is sending malformed events. (5) Referential integrity — dbt tests: NOT NULL checks, UNIQUE checks, relationships (every order_id in events table must exist in orders table). Run on each pipeline execution. Publish metrics to Prometheus/Grafana; page on P99 latency or volume anomalies.”}}]}

Databricks system design focuses on data pipelines and streaming. See common questions for Databricks interview: data pipeline and stream processing design.

Amazon system design covers data pipelines and ETL architecture. Review patterns for Amazon interview: data pipeline and ETL system design.

LinkedIn system design covers large-scale data pipelines. See design patterns for LinkedIn interview: data pipeline and analytics system design.

See also: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence

Scroll to Top