What Is a Stream Processing Engine?
A stream processing engine consumes continuous event streams, applies stateful or stateless transformations, and produces derived output streams or database updates in near real time. Unlike batch processing, stream engines must handle out-of-order events, maintain correctness guarantees across failures, and make principled decisions about when to emit results for time-based aggregations.
Requirements
Functional Requirements
- Ingest events from one or more source partitions (Kafka topics or similar).
- Support tumbling windows (fixed, non-overlapping time intervals) and sliding windows (overlapping intervals).
- Use event-time timestamps and watermarks to handle late-arriving events.
- Emit window results when the watermark advances past the window close time.
- Write output to sinks (Kafka, databases, or object storage) with exactly-once semantics.
Non-Functional Requirements
- Process 500,000 events per second per operator with under 500 ms end-to-end latency for in-order events.
- Recover from worker failure within 30 seconds without data loss or duplication.
- State size per operator may reach hundreds of gigabytes; state must be spilled to disk efficiently.
Data Model
Event
- event_id (UUID or source partition + offset pair)
- event_time (timestamp embedded in the payload by the producer)
- ingestion_time (timestamp when the engine received the event)
- key (string used for partitioning and keyed state lookup)
- payload (bytes or structured map)
Window State
Per-key window state is stored in a local RocksDB instance on each worker. The key is (operator_id, event_key, window_start_time) and the value is the accumulated aggregate (e.g. count, sum, or a serialized partial state object). RocksDB provides fast sequential writes and supports incremental checkpointing to object storage for recovery.
Core Algorithms
Watermark Propagation
A watermark W(t) asserts that no future event with event_time less than t will arrive. Watermarks are generated at source operators based on the maximum observed event_time minus a configured allowed lateness (e.g. 5 seconds). At each processing step, the watermark advances to the minimum watermark across all input partitions — this prevents a single slow partition from blocking output indefinitely.
When the watermark advances past a window close time, the engine fires the window: it reads the accumulated state for that window, emits the result to the output sink, and clears the state. Late events that arrive after the window fires are either dropped or added to a side output stream for separate handling.
Tumbling vs Sliding Windows
For a tumbling window of size T, each event belongs to exactly one window: window_start = floor(event_time / T) * T. For a sliding window with size T and slide S, each event belongs to ceil(T/S) windows, computed as all windows whose start time w satisfies w <= event_time < w + T. The engine uses this assignment function to update multiple window states per event during sliding window processing.
Exactly-Once Semantics
The engine implements a distributed snapshot algorithm (Chandy-Lamport variant). At regular intervals, the coordinator injects barrier markers into each source partition. When an operator receives barriers from all input partitions, it takes a checkpoint of its local RocksDB state to object storage and acknowledges the checkpoint. Output writes are wrapped in transactions: the sink record and the source offset are committed atomically. On recovery, the engine restores from the last complete checkpoint and replays source events from the committed offsets, ensuring no event is counted twice.
Scalability
The job is divided into a dataflow graph of operators, each assigned to one or more parallel subtasks. Kafka partitions are distributed across subtask instances; each subtask processes a subset of keys. State is keyed, so all events for the same key always route to the same subtask, making state lookups purely local. Adding parallelism requires repartitioning state, which is handled through a rescale operation that redistributes RocksDB key ranges across the new task slots.
API Design
- POST /jobs — submit a job definition (operator graph, source/sink configuration, window parameters, watermark lag).
- GET /jobs/{job_id} — return job status, current watermark per source partition, and operator-level throughput metrics.
- POST /jobs/{job_id}/savepoint — trigger a manual savepoint for migration or upgrade.
- DELETE /jobs/{job_id} — cancel a running job; optionally take a final savepoint before termination.
Failure Modes
- Worker crash: The coordinator detects the missing heartbeat, reassigns the worker tasks to surviving workers or new workers, and restores from the last checkpoint. In-flight events since the checkpoint are replayed from source.
- Slow partition holding back watermark: A configurable idle-source timeout allows the engine to advance the watermark even when a partition has not produced events for a configured duration, preventing indefinite blocking.
- State too large for available memory: RocksDB spills cold state to local SSD automatically. If disk is also exhausted, the job fails and must be restarted with more resources or a compaction strategy.
Observability
Monitor watermark lag per source partition, window firing latency relative to watermark, checkpoint duration and size, operator throughput in events per second, and backpressure propagation through the operator graph. Alert when watermark lag exceeds the configured allowed lateness threshold, which indicates the engine is falling behind real time.
See also: Scale AI Interview Guide 2026: Data Infrastructure, RLHF Pipelines, and ML Engineering
See also: Databricks Interview Guide 2026: Spark Internals, Delta Lake, and Lakehouse Architecture
See also: Anthropic Interview Guide 2026: Process, Questions, and AI Safety