Lambda and Kappa architectures are data processing patterns for building systems that must handle both real-time (streaming) and historical (batch) data queries. Lambda architecture uses separate batch and streaming layers that are reconciled. Kappa architecture simplifies this by processing everything through a single streaming layer. These patterns are central to building analytics platforms, recommendation systems, fraud detection, and any system that needs sub-second query responses over large historical datasets.
Lambda Architecture: Three Layers
Lambda architecture has three layers: Batch Layer: processes all historical data on a schedule (hourly/daily). Recomputes views from scratch using MapReduce/Spark. Accurate but slow — results may be hours old. Speed Layer: processes recent data in real-time (Flink, Spark Streaming, Kafka Streams). Compensates for the batch layer latency by providing fresh but potentially approximate results. Serving Layer: merges batch and speed layer results for queries. A query returns: batch view (accurate, stale) + speed view (approximate, fresh) = complete answer. Examples: HBase for batch views, Druid for speed views, merged at query time.
// Lambda Architecture query merge
func QueryMetric(metric string, timeRange TimeRange) float64 {
// Batch layer: accurate but covers only up to last batch run
batchResult := batchServing.Query(metric, timeRange.Start, batchCutoff)
// Speed layer: covers from batch cutoff to now (approximate)
speedResult := speedLayer.Query(metric, batchCutoff, timeRange.End)
// Merge: batch is source of truth for historical period
// Speed fills the recency gap
return batchResult + speedResult
}
// Problem: maintaining TWO codebases (batch + streaming)
// that must produce identical results is operationally expensive
Lambda Architecture Problems
Lambda architecture has two critical operational problems: (1) Code duplication: the same business logic must be implemented in both the batch layer (Spark) and the speed layer (Flink), maintained in sync, and producing identical results. Divergence between them creates inconsistency. (2) Complexity: operating two completely separate systems (batch and streaming) with different tooling, debugging approaches, and failure modes is expensive. Nathan Marz (creator of Lambda) later endorsed moving away from it due to these operational costs.
Kappa Architecture: Stream Everything
Kappa architecture eliminates the batch layer entirely. All data goes through a streaming system (Kafka + Flink). Historical reprocessing is done by replaying the Kafka log from the beginning with a new job version — not by running a separate batch system. The serving layer is populated only from stream outputs. This eliminates code duplication (one streaming codebase for all processing) and reduces operational complexity. The trade-off: Kafka must retain all historical data (expensive for long history), and streaming jobs must handle exactly-once semantics for reprocessing.
Modern Alternative: Streaming SQL + Data Lakehouse
The current best practice merges streaming and batch through a unified SQL layer. Tools: Apache Flink SQL — write one SQL query, execute it in either streaming or batch mode. Delta Lake / Apache Iceberg — ACID transactions on object storage, enabling streaming writes and batch reads from the same table. Materialize / RisingWave — streaming databases that maintain incrementally updated materialized views in memory, answering historical queries from a stream-native store. This converges Lambda and Kappa into one system with one codebase.
Key Interview Discussion Points
- Latency requirements determine architecture: sub-second queries require streaming; minute-level latency can use micro-batching; hour-level latency can use pure batch
- Reprocessing in Kappa: requires enough Kafka retention to cover the reprocessing window — for 2 years of history at 1TB/day, Kafka retention alone is 730TB
- Exactly-once streaming: Flink checkpointing + Kafka transactional producers enable exactly-once processing for reprocessing consistency
- Serving layer options: Druid (real-time OLAP), ClickHouse (columnar OLAP), Redis (pre-aggregated results), Pinot (low-latency analytics at LinkedIn/Uber)
- Cost trade-off: streaming (Flink cluster) is more expensive than batch (Spark on-demand) for infrequently queried historical data