MapReduce is a programming model for processing large datasets in parallel across a cluster of commodity machines. Introduced by Google (2004), it inspired Hadoop MapReduce, Apache Spark, and the broader ecosystem of batch processing frameworks. Understanding MapReduce internals — data locality, shuffle, fault tolerance — is foundational for understanding how large-scale data processing systems work. Even as Spark and Flink supersede Hadoop MapReduce, the Map → Shuffle → Reduce paradigm appears in database query engines, distributed aggregations, and data pipeline design.
MapReduce Execution Model
A MapReduce job has two phases: Map phase: the input dataset is split into chunks (typically 64-128MB HDFS blocks). Each Map task processes one chunk, calling the user-defined map() function on each record and emitting (key, value) pairs. Map tasks run on the same machines as the data blocks (data locality: move computation to data, not data to computation). Reduce phase: all (key, value) pairs with the same key are grouped and sent to the same Reduce task (shuffle + sort). Each Reduce task calls the user-defined reduce() function once per unique key, receiving all values for that key. The shuffle (transferring map output to reducers) is the most network-intensive and expensive phase — it transfers every byte of intermediate data across the network.
# Word count: the canonical MapReduce example
# Input: text documents
# Output: word -> count
# Map function: emit (word, 1) for each word in the document
def map(document_id, text):
for word in text.split():
emit(word.lower(), 1)
# Map output example: ("the", 1), ("quick", 1), ("the", 1), ("fox", 1)
# Shuffle phase (framework handles this):
# Groups all values by key and sends to reducer
# ("fox", [1, 1, 1]), ("quick", [1, 1]), ("the", [1, 1, 1, 1, 1])
# Reduce function: sum all counts for each word
def reduce(word, counts):
emit(word, sum(counts))
# Reduce output: ("fox", 3), ("quick", 2), ("the", 5)
# Apache Spark equivalent (much faster due to in-memory processing):
# sc.textFile("hdfs://input")
# .flatMap(lambda line: line.split())
# .map(lambda word: (word.lower(), 1))
# .reduceByKey(lambda a, b: a + b)
# .saveAsTextFile("hdfs://output")
Fault Tolerance and Speculative Execution
MapReduce is designed for commodity hardware where node failures are expected. Fault tolerance: the master (JobTracker in Hadoop, driver in Spark) tracks task progress. If a Map task fails, it is re-executed on another node (map output is recomputed since input is still in HDFS). If a Reduce task fails, it is re-executed from the shuffle phase. Speculative execution (straggler mitigation): some tasks run slower than others due to hardware degradation, GC pauses, or data skew. The master detects slow tasks (below average progress) and launches duplicate speculative copies on other nodes. Whichever finishes first is used; the other is killed. This bounds job completion time to a multiple of the fastest task, preventing slow outliers from extending total job duration.
Apache Spark: In-Memory Batch Processing
Hadoop MapReduce writes intermediate results to disk after every Map phase and every Reduce phase — for iterative algorithms (machine learning training), this means reading and writing the full dataset from disk on every iteration (100x slower than in-memory). Apache Spark keeps intermediate results in memory (RDDs/DataFrames) and computes lazily (building a DAG of transformations that only execute when an action is called). For iterative workloads, Spark is 10-100x faster than Hadoop MapReduce. Spark also unifies batch, streaming (Spark Streaming, Structured Streaming), machine learning (MLlib), and graph processing (GraphX) under one engine. The Catalyst query optimizer compiles DataFrame operations into efficient physical plans with predicate pushdown and join reordering.
Key Interview Discussion Points
- Data skew: when one key has far more values than others, its reducer becomes a bottleneck (the reducer for “the” in word count processes 10x more data than others); solutions: salting (add random prefix to skewed keys and aggregate twice), sampling to detect skew and redistribute, or using Spark broadcast joins for small tables
- Combiner optimization: a Combiner (mini-reducer) runs after the Map phase on the same node, pre-aggregating values before the shuffle (sum counts locally before sending across the network); reduces shuffle data volume dramatically for commutative, associative operations
- Partitioning: the shuffle partitions keys across reducers using a hash function; custom partitioners ensure related keys go to the same reducer (e.g., all keys with the same date go to one reducer for time-series aggregation)
- Lambda vs Kappa architecture: Lambda runs separate batch (MapReduce/Spark) and streaming (Flink) pipelines, merging results for queries; Kappa runs only a streaming pipeline and replays historical data through it for batch reprocessing
- Dataflow model (Google): generalizes batch and streaming by treating everything as data with event-time windows; Apache Beam implements the Dataflow model with runners for Spark, Flink, and Google Dataflow