Principle:Heibaiying BigData Notes MapReduce Map Phase
| Knowledge Sources | |
|---|---|
| Domains | Distributed_Computing, Big_Data |
| Last Updated | 2026-02-10 10:00 GMT |
Overview
The Map phase of MapReduce transforms each input record into zero or more intermediate key-value pairs that are subsequently sorted, shuffled, and delivered to reducers.
Description
In the MapReduce programming model, the Map phase is the first stage of computation. The framework reads input data, splits it into records according to the configured InputFormat, and invokes the user-defined map() function once per record. Each invocation receives a key-value pair representing a single record (for example, a byte offset and a line of text) and emits zero or more intermediate key-value pairs.
For a word count application, the mapper receives each line of text, tokenizes it by a delimiter (such as a tab character), and emits a pair (word, 1) for every token. This transforms unstructured text into a structured stream of countable events. The framework guarantees that all pairs with the same key will be grouped together before being passed to the reducer.
The Map phase runs in parallel across all available mapper slots in the cluster. Each mapper processes one InputSplit independently, which means the map function must be stateless with respect to other splits. This embarrassingly parallel design is what gives MapReduce its scalability.
Usage
Use a custom Mapper when:
- You need to parse, filter, or transform raw input records into a structured intermediate representation.
- You want to extract specific fields or tokens from each input line.
- You need to emit multiple key-value pairs per input record (e.g., one pair per word in a line).
- The transformation logic is stateless and can operate on each record independently.
Theoretical Basis
The Map function can be formally described as:
map: (K1, V1) -> list(K2, V2)
Where:
- K1 is the input key type (e.g., LongWritable representing the byte offset of the line).
- V1 is the input value type (e.g., Text representing the line content).
- K2 is the output key type (e.g., Text representing a word).
- V2 is the output value type (e.g., IntWritable representing the count 1).
For word count, given an input line L = "Spark\tHadoop\tHBase", the map function performs:
- Split the line by the tab delimiter to produce tokens: ["Spark", "Hadoop", "HBase"].
- Emit for each token t: (t, 1).
- The output for this single line is: [("Spark", 1), ("Hadoop", 1), ("HBase", 1)].
Across all mappers, the total number of emitted pairs equals the total number of words in the entire input dataset. The framework then sorts these pairs by key and partitions them for delivery to the appropriate reducer.