Data Pipeline System Low-Level Design

What is a Data Pipeline?

A data pipeline is a sequence of data processing steps: collect → transform → store → serve. Modern pipelines distinguish batch (hourly/daily, high throughput) from streaming (real-time, low latency). The choice depends on acceptable latency: dashboards updating every second need streaming; nightly revenue reports need batch.

Architecture: Batch Pipeline

Source (DB/S3/API) → Ingestion → Storage (Data Lake: S3/GCS)
                                 → Transform (Spark/dbt)
                                 → Data Warehouse (BigQuery/Snowflake/Redshift)
                                 → Serve (Looker/Tableau/ad-hoc SQL)

ETL (Extract-Transform-Load) vs ELT (Extract-Load-Transform): ETL transforms before loading (traditional, good for data quality enforcement); ELT loads raw then transforms in the warehouse (modern cloud DWH approach — cheap storage, powerful SQL transforms). dbt (data build tool) is the standard for ELT transforms: SQL + Jinja templates, incremental models, lineage tracking.

Architecture: Streaming Pipeline

Event Source → Kafka → Stream Processor (Flink / Kafka Streams)
                     → Serving DB (Druid / ClickHouse / Redis)
                     → Alert System (PagerDuty / Slack)
                     → S3 (raw archive, for batch reprocessing)

Data Model: Event Schema

Event(event_id UUID, event_type VARCHAR, user_id, session_id,
      properties JSONB, client_ts TIMESTAMP, server_ts TIMESTAMP,
      app_version, platform, ingested_at)

Use a schema registry (Confluent Schema Registry or AWS Glue) to enforce schema versions. Avro or Protobuf formats for efficient serialization. JSON is human-readable but 3-10x larger than binary formats.

Ingestion Layer

Client SDKs batch events locally (10 events or 5 seconds, whichever first) before sending to reduce API load. Collector API: validate schema, enrich (add server_ts, geo from IP, device type), publish to Kafka. Return 200 immediately — never block on processing. Kafka partition by user_id for ordered per-user processing. Dead-letter topic for malformed events that fail validation.

Transformation Patterns

  • Sessionization: group events by session (30-min inactivity gap). Flink session windows or offline SQL (GROUP BY user_id + session_id).
  • Deduplication: clients may retry failed sends. Deduplicate by event_id (UUID): INSERT … ON CONFLICT (event_id) DO NOTHING, or filter in stream processor by tracking seen event_ids with a Bloom filter (TTL=1hr).
  • Enrichment: join events with dimension tables (user profile, product catalog). In batch: SQL JOIN in the warehouse. In streaming: Flink async I/O to fetch user data from Redis.
  • Aggregation: TUMBLE windows (fixed, non-overlapping: 1-minute buckets), HOP windows (overlapping: 5-minute window sliding every 1 minute), SESSION windows (gap-based).

Data Quality

  • Schema validation: reject events with missing required fields at ingestion
  • Volume checks: alert if event volume drops by >20% vs same hour yesterday (indicates SDK bug or outage)
  • Null rate monitoring: alert if key fields have null rate > threshold
  • End-to-end latency: measure time from client_ts to available in serving layer; alert if >10 minutes
  • dbt tests: NOT NULL, UNIQUE, referential integrity checks run on each pipeline run

Orchestration

Airflow (or Prefect, Dagster): define DAGs of tasks with dependencies. Airflow schedules batch jobs, handles retries, sends alerts on failure. Key patterns: sensor tasks (wait for upstream S3 file before starting transform), dynamic task mapping (one task per partition), SLA monitoring (alert if DAG doesn’t complete within expected window). For streaming: Flink jobs are long-running processes managed by Kubernetes or Flink standalone cluster. Monitor via Flink dashboard + Prometheus metrics.

Key Design Decisions

  • Kafka as the central event bus — decouples all producers from all consumers
  • Raw event archive in S3 (immutable) — enables reprocessing when downstream bugs are found
  • ELT over ETL for cloud DWH — cheaper and more flexible, transforms can be changed without re-ingestion
  • Schema registry — prevents silent schema breaks from cascading through the pipeline

Databricks system design focuses on data pipelines and streaming. See common questions for Databricks interview: data pipeline and stream processing design.

Amazon system design covers data pipelines and ETL architecture. Review patterns for Amazon interview: data pipeline and ETL system design.

LinkedIn system design covers large-scale data pipelines. See design patterns for LinkedIn interview: data pipeline and analytics system design.

See also: Netflix Interview Guide 2026: Streaming Architecture, Recommendation Systems, and Engineering Excellence

Scroll to Top