Principle:Heibaiying BigData Notes MapReduce Reduce Phase
| Knowledge Sources | |
|---|---|
| Domains | Distributed_Computing, Big_Data |
| Last Updated | 2026-02-10 10:00 GMT |
Overview
The Reduce phase of MapReduce aggregates all intermediate values associated with each unique key to produce the final output of the computation.
Description
After the Map phase completes, the framework performs a shuffle and sort step that groups all intermediate key-value pairs by key. The Reduce phase then invokes the user-defined reduce() function once per unique key, passing the key and an iterable of all values associated with that key.
In a word count application, the reducer receives a word as the key and an iterable of integer counts (each equal to 1) as the values. The reducer sums these counts and emits a single pair (word, totalCount), producing the final word frequency for that word.
The number of reducers is configurable and determines the degree of parallelism in the reduce phase. Each reducer handles a subset of keys as determined by the Partitioner. The output of each reducer is written to a separate output file in the designated HDFS output directory (e.g., part-r-00000, part-r-00001, etc.).
Usage
Use a custom Reducer when:
- You need to aggregate, summarize, or combine values that share the same key.
- The aggregation logic (sum, average, max, min, concatenation, etc.) requires iterating over all values for a key.
- You want to produce a single consolidated result per unique key.
- The reduce function is associative and commutative (which also allows it to be used as a Combiner).
Theoretical Basis
The Reduce function can be formally described as:
reduce: (K2, list(V2)) -> list(K3, V3)
Where:
- K2 is the intermediate key type (e.g., Text representing a word).
- list(V2) is the list of all intermediate values for that key (e.g., a list of IntWritable values).
- K3 is the output key type (typically the same as K2).
- V3 is the output value type (e.g., IntWritable representing the total count).
For word count, given the key "Hadoop" and values [1, 1, 1, 1, 1], the reduce function performs:
- Initialize a counter: sum = 0.
- Iterate over each value v in the values iterable: sum = sum + v.
- Emit the result: ("Hadoop", 5).
The correctness of the reduce phase depends on the shuffle and sort guarantee: all values for a given key are delivered to exactly one reducer, and they arrive grouped together. This invariant ensures that no counts are lost or double-counted.