Low Level Design: Lambda and Kappa Architecture for Big Data

⏱ 3 min read

Lambda and Kappa architectures are data processing patterns for building systems that must handle both real-time (streaming) and historical (batch) data queries. Lambda architecture uses separate batch and streaming layers that are reconciled. Kappa architecture simplifies this by processing everything through a single streaming layer. These patterns are central to building analytics platforms, recommendation systems, fraud detection, and any system that needs sub-second query responses over large historical datasets.

Lambda Architecture: Three Layers

Lambda architecture has three layers: Batch Layer: processes all historical data on a schedule (hourly/daily). Recomputes views from scratch using MapReduce/Spark. Accurate but slow — results may be hours old. Speed Layer: processes recent data in real-time (Flink, Spark Streaming, Kafka Streams). Compensates for the batch layer latency by providing fresh but potentially approximate results. Serving Layer: merges batch and speed layer results for queries. A query returns: batch view (accurate, stale) + speed view (approximate, fresh) = complete answer. Examples: HBase for batch views, Druid for speed views, merged at query time.

// Lambda Architecture query merge
func QueryMetric(metric string, timeRange TimeRange) float64 {
    // Batch layer: accurate but covers only up to last batch run
    batchResult := batchServing.Query(metric, timeRange.Start, batchCutoff)

    // Speed layer: covers from batch cutoff to now (approximate)
    speedResult := speedLayer.Query(metric, batchCutoff, timeRange.End)

    // Merge: batch is source of truth for historical period
    // Speed fills the recency gap
    return batchResult + speedResult
}

// Problem: maintaining TWO codebases (batch + streaming)
// that must produce identical results is operationally expensive

Lambda Architecture Problems

Lambda architecture has two critical operational problems: (1) Code duplication: the same business logic must be implemented in both the batch layer (Spark) and the speed layer (Flink), maintained in sync, and producing identical results. Divergence between them creates inconsistency. (2) Complexity: operating two completely separate systems (batch and streaming) with different tooling, debugging approaches, and failure modes is expensive. Nathan Marz (creator of Lambda) later endorsed moving away from it due to these operational costs.

Kappa Architecture: Stream Everything

Kappa architecture eliminates the batch layer entirely. All data goes through a streaming system (Kafka + Flink). Historical reprocessing is done by replaying the Kafka log from the beginning with a new job version — not by running a separate batch system. The serving layer is populated only from stream outputs. This eliminates code duplication (one streaming codebase for all processing) and reduces operational complexity. The trade-off: Kafka must retain all historical data (expensive for long history), and streaming jobs must handle exactly-once semantics for reprocessing.

Modern Alternative: Streaming SQL + Data Lakehouse

The current best practice merges streaming and batch through a unified SQL layer. Tools: Apache Flink SQL — write one SQL query, execute it in either streaming or batch mode. Delta Lake / Apache Iceberg — ACID transactions on object storage, enabling streaming writes and batch reads from the same table. Materialize / RisingWave — streaming databases that maintain incrementally updated materialized views in memory, answering historical queries from a stream-native store. This converges Lambda and Kappa into one system with one codebase.

Key Interview Discussion Points

Latency requirements determine architecture: sub-second queries require streaming; minute-level latency can use micro-batching; hour-level latency can use pure batch
Reprocessing in Kappa: requires enough Kafka retention to cover the reprocessing window — for 2 years of history at 1TB/day, Kafka retention alone is 730TB
Exactly-once streaming: Flink checkpointing + Kafka transactional producers enable exactly-once processing for reprocessing consistency
Serving layer options: Druid (real-time OLAP), ClickHouse (columnar OLAP), Redis (pre-aggregated results), Pinot (low-latency analytics at LinkedIn/Uber)
Cost trade-off: streaming (Flink cluster) is more expensive than batch (Spark on-demand) for infrequently queried historical data