Question 1

What is the difference between ETL and ELT in data pipelines?

Accepted Answer

ETL (Extract-Transform-Load): data is extracted from sources, transformed (cleaned, joined, aggregated) in a separate processing layer, then loaded into the destination. Traditional approach, used when the destination has limited compute or when transformation must happen before storage for compliance. ELT (Extract-Load-Transform): raw data is loaded into the data warehouse first, then transformed using SQL inside the warehouse. Modern cloud DWH approach (BigQuery, Snowflake, Redshift). Benefits: raw data is preserved for reprocessing, transforms can be changed without re-ingesting data, warehouses have enormous SQL compute power. dbt (data build tool) is the standard ELT transformation layer: SQL + Jinja templates, incremental models, built-in lineage tracking and testing.

Question 2

How does Kafka serve as the backbone of a data pipeline?

Accepted Answer

Kafka decouples all producers from all consumers. Producers (application servers, microservices) publish events to Kafka topics without knowing who will consume them. Multiple consumers (stream processors, batch consumers, audit loggers) read independently at their own pace. Key properties: (1) Durability — events are persisted to disk for a configurable retention period (default 7 days), enabling replay. (2) Ordering — events within a partition are ordered. Partition by user_id for per-user ordering. (3) High throughput — millions of events/second. (4) Consumer groups — multiple instances of the same consumer share the load (each partition is consumed by one instance). For the data pipeline, Kafka sits between the ingestion API and all downstream processing: Flink for streaming, Spark for batch, and the raw S3 archive.

Question 3

How does deduplication work in a streaming data pipeline?

Accepted Answer

Client SDKs retry failed sends, producing duplicate events. Each event has a UUID event_id. Deduplication strategies: (1) Exactly-once semantics in Kafka (enable.idempotence=true + transactional producers/consumers) — prevents duplicates within the Kafka layer. (2) Stream processor deduplication: in Flink, use a KeyedProcessFunction that tracks seen event_ids in RocksDB state with a TTL (e.g., 1 hour). If event_id already seen, drop the event. TTL prevents unbounded state growth. (3) Database deduplication: INSERT ... ON CONFLICT (event_id) DO NOTHING — the unique index on event_id prevents duplicate rows. Use option 3 as the final safety net; option 2 for stream processing. Option 1 for inter-service Kafka guarantees.

Question 4

What are tumble, hop, and session windows in stream processing?

Accepted Answer

Windows define how to group events by time for aggregation. TUMBLE (fixed, non-overlapping): events are grouped into fixed-size buckets. Example: 1-minute tumbling window — events at 12:00:00-12:00:59 form one bucket, 12:01:00-12:01:59 the next. Use for: per-minute metrics, billing aggregations. HOP (sliding window): fixed size window that advances by a smaller step. Example: 5-minute window, 1-minute hop — at 12:05, the window covers 12:00-12:05; at 12:06, it covers 12:01-12:06. Events appear in multiple windows. Use for: rolling averages, moving metrics. SESSION (gap-based): a window closes after a period of inactivity. Example: 30-minute gap — all events within 30 minutes of each other form one session. No fixed size. Use for: user session analytics, journey analysis.

Question 5

How do you monitor data quality in a production pipeline?

Accepted Answer

Five key data quality dimensions: (1) Volume — alert if event count drops >20% vs the same time yesterday. Sudden drop indicates SDK bug, deployment issue, or data source outage. (2) Freshness — measure end-to-end latency from client_ts to available in serving layer. Alert if >10 minutes. (3) Completeness — monitor null rates on key fields (user_id, event_type). Alert if null rate exceeds threshold. (4) Schema validity — track schema validation failure rate. Rising failure rate means a client SDK is sending malformed events. (5) Referential integrity — dbt tests: NOT NULL checks, UNIQUE checks, relationships (every order_id in events table must exist in orders table). Run on each pipeline execution. Publish metrics to Prometheus/Grafana; page on P99 latency or volume anomalies.

Data Pipeline System Low-Level Design

What is a Data Pipeline?

Architecture: Batch Pipeline

Architecture: Streaming Pipeline

Data Model: Event Schema

Ingestion Layer

Transformation Patterns

Data Quality

Orchestration

Key Design Decisions