What is the difference between tumbling and sliding window semantics in stream processing?

A tumbling window divides time into non-overlapping, fixed-size buckets — every event belongs to exactly one window and results are emitted once per bucket. A sliding window advances by a step smaller than its size, so consecutive windows overlap and each event may appear in multiple windows. Tumbling windows are cheaper (O(1) state per key) while sliding windows provide smoother, more granular aggregations at higher memory cost.

How do event-time watermarks handle late-arriving data in a stream processor?

The processor tracks the maximum observed event timestamp and subtracts a configured allowed-lateness duration to compute the current watermark. When the watermark advances past a window's end time, that window is finalized and its result emitted. Events arriving after the watermark has passed their window are either dropped, counted in a side-output late-data stream, or used to issue a corrective update depending on the operator's late-data policy.

How is exactly-once processing achieved with idempotent sinks?

The processor checkpoints its source offsets and operator state atomically. On restart it replays from the last checkpoint, which may re-deliver records to the sink. An idempotent sink handles this by keying writes on a deterministic record ID derived from the source offset or event key — writing the same record twice produces the same final state as writing it once, so replays are safe without distributed two-phase commit.

Why is RocksDB used as the state backend in stream processing engines like Flink?

RocksDB is an embedded LSM-tree key-value store that keeps hot state in memory and spills cold state to disk, allowing stateful operators to maintain far more state than fits in the JVM heap without out-of-memory errors. It supports incremental checkpointing, meaning only changed SST file blocks are uploaded to durable storage on each checkpoint rather than the full state snapshot, reducing checkpoint I/O significantly.

Stream Processing Engine Low-Level Design: Windowing, Watermarks, and Exactly-Once Semantics

⏱ 6 min read

What Is a Stream Processing Engine?

A stream processing engine consumes continuous event streams, applies stateful or stateless transformations, and produces derived output streams or database updates in near real time. Unlike batch processing, stream engines must handle out-of-order events, maintain correctness guarantees across failures, and make principled decisions about when to emit results for time-based aggregations.

Requirements

Functional Requirements

Ingest events from one or more source partitions (Kafka topics or similar).
Support tumbling windows (fixed, non-overlapping time intervals) and sliding windows (overlapping intervals).
Use event-time timestamps and watermarks to handle late-arriving events.
Emit window results when the watermark advances past the window close time.
Write output to sinks (Kafka, databases, or object storage) with exactly-once semantics.

Non-Functional Requirements

Process 500,000 events per second per operator with under 500 ms end-to-end latency for in-order events.
Recover from worker failure within 30 seconds without data loss or duplication.
State size per operator may reach hundreds of gigabytes; state must be spilled to disk efficiently.

Data Model

Event

event_id (UUID or source partition + offset pair)
event_time (timestamp embedded in the payload by the producer)
ingestion_time (timestamp when the engine received the event)
key (string used for partitioning and keyed state lookup)
payload (bytes or structured map)

Window State

Per-key window state is stored in a local RocksDB instance on each worker. The key is (operator_id, event_key, window_start_time) and the value is the accumulated aggregate (e.g. count, sum, or a serialized partial state object). RocksDB provides fast sequential writes and supports incremental checkpointing to object storage for recovery.

Core Algorithms

Watermark Propagation

A watermark W(t) asserts that no future event with event_time less than t will arrive. Watermarks are generated at source operators based on the maximum observed event_time minus a configured allowed lateness (e.g. 5 seconds). At each processing step, the watermark advances to the minimum watermark across all input partitions — this prevents a single slow partition from blocking output indefinitely.

When the watermark advances past a window close time, the engine fires the window: it reads the accumulated state for that window, emits the result to the output sink, and clears the state. Late events that arrive after the window fires are either dropped or added to a side output stream for separate handling.

Tumbling vs Sliding Windows

For a tumbling window of size T, each event belongs to exactly one window: window_start = floor(event_time / T) * T. For a sliding window with size T and slide S, each event belongs to ceil(T/S) windows, computed as all windows whose start time w satisfies w <= event_time < w + T. The engine uses this assignment function to update multiple window states per event during sliding window processing.

Exactly-Once Semantics

The engine implements a distributed snapshot algorithm (Chandy-Lamport variant). At regular intervals, the coordinator injects barrier markers into each source partition. When an operator receives barriers from all input partitions, it takes a checkpoint of its local RocksDB state to object storage and acknowledges the checkpoint. Output writes are wrapped in transactions: the sink record and the source offset are committed atomically. On recovery, the engine restores from the last complete checkpoint and replays source events from the committed offsets, ensuring no event is counted twice.

Scalability

The job is divided into a dataflow graph of operators, each assigned to one or more parallel subtasks. Kafka partitions are distributed across subtask instances; each subtask processes a subset of keys. State is keyed, so all events for the same key always route to the same subtask, making state lookups purely local. Adding parallelism requires repartitioning state, which is handled through a rescale operation that redistributes RocksDB key ranges across the new task slots.

API Design

POST /jobs — submit a job definition (operator graph, source/sink configuration, window parameters, watermark lag).
GET /jobs/{job_id} — return job status, current watermark per source partition, and operator-level throughput metrics.
POST /jobs/{job_id}/savepoint — trigger a manual savepoint for migration or upgrade.
DELETE /jobs/{job_id} — cancel a running job; optionally take a final savepoint before termination.

Failure Modes

Worker crash: The coordinator detects the missing heartbeat, reassigns the worker tasks to surviving workers or new workers, and restores from the last checkpoint. In-flight events since the checkpoint are replayed from source.
Slow partition holding back watermark: A configurable idle-source timeout allows the engine to advance the watermark even when a partition has not produced events for a configured duration, preventing indefinite blocking.
State too large for available memory: RocksDB spills cold state to local SSD automatically. If disk is also exhausted, the job fails and must be restarted with more resources or a compaction strategy.

Observability

Monitor watermark lag per source partition, window firing latency relative to watermark, checkpoint duration and size, operator throughput in events per second, and backpressure propagation through the operator graph. Alert when watermark lag exceeds the configured allowed lateness threshold, which indicates the engine is falling behind real time.