What Is a Streaming Data Pipeline?
A streaming data pipeline continuously ingests, processes, and routes events in real time — typically with end-to-end latency under one second. Unlike batch pipelines that process accumulated data hourly or daily, streaming pipelines are always running, consuming from a durable log (Kafka), applying transformations (Flink, Spark Streaming), and writing results to sinks (databases, dashboards, downstream Kafka topics).
Common use cases: real-time fraud detection, live dashboards, recommendation model feature computation, IoT sensor processing, log anomaly detection.
System Requirements
Functional
- Ingest events from multiple producers (web, mobile, microservices)
- Process events: filter, transform, aggregate, join
- Support windowed aggregations (tumbling, sliding, session windows)
- Write results to multiple sinks: databases, search indexes, downstream topics
- Handle late-arriving events gracefully
Non-Functional
- End-to-end latency: <500ms for real-time; <5 seconds for near-real-time
- Throughput: 1M events/second per pipeline
- Exactly-once semantics for financial use cases
- At-least-once with idempotent consumers for analytics use cases
High-Level Architecture
The pipeline has four layers:
- Producers: services publish events to Kafka topics
- Kafka: durable, partitioned, replicated event log — the backbone
- Stream Processor: Flink or Spark Streaming jobs consume from Kafka, apply logic, write to sinks
- Sinks: Cassandra (time-series), Elasticsearch (search), Redis (real-time lookup), downstream Kafka topics
Kafka as the Backbone
Kafka is the right choice for a streaming pipeline backbone because:
- Durability: events are retained on disk (configurable retention, e.g., 7 days). Consumers can replay.
- Decoupling: producers don’t know about consumers. Multiple consumer groups read independently at their own offsets.
- Scalability: partition a topic by key (e.g., user_id). Each partition is consumed by one consumer in a group — linear scaling.
- Backpressure handling: if a consumer falls behind, messages wait in Kafka. No message loss, and the producer isn’t slowed down.
Kafka Partition Design
Partition count determines maximum parallelism. Rule of thumb: number of partitions = target throughput / throughput per consumer. For 1M events/sec with consumers processing 100K/sec each: 10 partitions minimum. Choose a partition key that distributes load evenly — avoid hot keys (e.g., don’t partition by country if 80% of traffic is from the US; use user_id hash instead).
Apache Flink for Stream Processing
Flink is a distributed stateful stream processor. Key concepts:
Operators
- map/flatMap: transform one event into zero or more events
- filter: drop events not matching a predicate
- window: group events by time range for aggregation
- reduce/aggregate: compute running totals, counts, averages within a window
keyBy: route events with the same key to the same parallel operator (enables stateful per-key aggregations)
Window Types
- Tumbling window: fixed-size, non-overlapping. “Count events per minute.” Window [0:00, 1:00), [1:00, 2:00).
- Sliding window: fixed size, overlapping. “5-minute window, slide every 1 minute.” More CPU-intensive — each event belongs to multiple windows.
- Session window: closes after a configurable gap of inactivity. “A user session ends after 30 minutes of no events.” Window size is data-driven, not fixed.
Event Time vs. Processing Time
Processing time is wall-clock time when the event arrives at Flink. Event time is when the event actually occurred (embedded in the event payload). Use event time for correct aggregations — mobile events can arrive 30 seconds late due to network issues. Flink tracks progress using watermarks: periodic markers that say “all events before timestamp T have arrived.” Events arriving after the watermark passes are “late” — configurable to drop, allow with side output, or update the window result.
State and Checkpointing
Flink operators can hold state (e.g., running count per user). State is stored in RocksDB (large state) or JVM heap (small state). Flink checkpoints state to a distributed store (S3, HDFS) every N seconds. On failure, the job restarts from the last checkpoint — exactly-once processing guarantee combined with transactional Kafka commits and idempotent sink writes.
Handling Late Events
Three strategies:
- Drop late events: simplest. Acceptable for metrics dashboards where slight undercounting is fine.
- Allowed lateness: keep window state open for X seconds after the watermark. Recompute and update the result when late events arrive. Works for analytics use cases where slight delay in finalization is acceptable.
- Side output: route late events to a separate Kafka topic for offline reprocessing. Use when late events cannot be dropped but real-time windows must not be delayed.
Exactly-Once Semantics
Exactly-once requires coordination between Kafka and the sink:
- Flink reads from Kafka with
isolation.level=read_committed - Flink checkpoints atomically commit the Kafka consumer offset and flush state to RocksDB
- Flink uses a two-phase commit (2PC) sink: pre-commit on checkpoint, commit when checkpoint completes. If the job crashes after pre-commit but before commit, the checkpoint restarts and the pre-committed transaction is aborted.
- The sink must support transactions (Kafka producer transactions, PostgreSQL, etc.) or be idempotent (Cassandra with upsert by event_id)
Scaling the Pipeline
- Kafka: add partitions (requires consumer group rebalance) or add brokers
- Flink: increase parallelism — each operator runs multiple parallel instances. The keyBy operator routes events to the correct instance. Flink handles rebalancing automatically.
- Sinks: Cassandra and Elasticsearch scale horizontally. Use batch writes with configurable flush intervals (e.g., every 500ms or 1000 events, whichever comes first) to avoid per-event write amplification.
Monitoring and Alerting
- Consumer lag: Kafka consumer group lag = latest offset – committed offset. Alert if lag grows unexpectedly (consumer is falling behind — add parallelism or optimize processing).
- End-to-end latency: embed event creation timestamp in payload; measure at sink. P99 latency >SLA triggers alert.
- Checkpoint duration: slow checkpoints indicate state is too large or RocksDB is thrashing. Alert on checkpoint failures (means job cannot recover from failure).
Interview Tips
- Always clarify: exactly-once or at-least-once? This drives the complexity of your design significantly.
- Explain watermarks when discussing late event handling — it shows deep understanding of Flink’s execution model.
- For fraud detection, mention that a stateful Flink job can join a live event stream with a slowly-changing reference dataset (e.g., user risk scores from a database) using Flink’s broadcast state pattern.
- Kafka Streams is an alternative to Flink for lighter pipelines — embedded in a microservice, no separate cluster. Trade-off: less powerful windowing and state management than Flink.
{
“@context”: “https://schema.org”,
“@type”: “FAQPage”,
“mainEntity”: [
{
“@type”: “Question”,
“name”: “What is the difference between Kafka Streams and Apache Flink?”,
“acceptedAnswer”: { “@type”: “Answer”, “text”: “Kafka Streams is a lightweight Java library embedded in your microservice — no separate cluster, no new infrastructure. It reads from Kafka, processes events, writes back to Kafka. Best for: simple transformations, per-event processing, stateful aggregations over small state. Apache Flink is a full distributed processing framework with its own cluster. It supports complex windowing (tumbling, sliding, session), large state (RocksDB-backed), sophisticated exactly-once semantics with 2PC sinks, and streaming joins. Best for: large-scale pipelines, complex event processing, sub-second aggregation windows, and joining multiple streams. If you need a quick pipeline inside a microservice: Kafka Streams. If you need a dedicated, high-throughput, stateful streaming system: Flink.” }
},
{
“@type”: “Question”,
“name”: “How does Apache Flink handle late-arriving events?”,
“acceptedAnswer”: { “@type”: “Answer”, “text”: “Flink uses watermarks to track event-time progress. A watermark W(t) asserts that all events with timestamp ≤ t have arrived. Flink generates watermarks periodically (e.g., every 200ms), typically with a configured maximum out-of-orderness (e.g., "events arrive at most 10 seconds late"). When a watermark passes a window's end time, the window fires and emits its result. Events arriving after their window's watermark are "late." You can configure: (1) drop late events (default), (2) allowed lateness — keep the window state open for N extra seconds and recompute on each late arrival, or (3) side output — route late events to a separate stream for offline reprocessing. The right choice depends on your latency tolerance and whether undercounting is acceptable.” }
},
{
“@type”: “Question”,
“name”: “How do you achieve exactly-once processing in a Kafka + Flink pipeline?”,
“acceptedAnswer”: { “@type”: “Answer”, “text”: “Exactly-once requires coordination across three components: (1) Kafka source — Flink reads with read_committed isolation and commits offsets only after checkpoints succeed. (2) Flink state — checkpointing atomically snapshots operator state to S3/HDFS every N seconds (configurable). (3) Sink — use a transactional sink implementing Flink's TwoPhaseCommitSinkFunction. On checkpoint: pre-commit (open a transaction on the sink). On checkpoint complete: commit the transaction. On failure: restart from last checkpoint, the uncommitted transaction is aborted. Supported sinks: Kafka (using producer transactions), PostgreSQL (using savepoints). For eventually-consistent sinks (Cassandra), use idempotent writes with a deduplication key derived from the Flink checkpoint ID + record position.” }
}
]
}