Workflow:Heibaiying BigData Notes Hadoop MapReduce Word Count
| Knowledge Sources | |
|---|---|
| Domains | Big_Data, Batch_Processing, Hadoop |
| Last Updated | 2026-02-10 10:00 GMT |
Overview
End-to-end process for building and running a Hadoop MapReduce word count application, from writing Mapper and Reducer components to submitting the job on an HDFS cluster.
Description
This workflow outlines the standard procedure for developing a distributed batch processing job using the Hadoop MapReduce framework. It covers writing a Mapper to tokenize input text, a Reducer to aggregate word counts, and optionally adding a Combiner for local pre-aggregation and a custom Partitioner for controlled output distribution. The process demonstrates the full MapReduce lifecycle: Input splitting, mapping, shuffling, reducing, and writing output to HDFS.
Usage
Execute this workflow when you need to process large volumes of text data stored in HDFS and want to count word frequencies using distributed batch computation. This is the canonical "Hello World" of big data processing and serves as the foundation for understanding all MapReduce-based pipelines.
Execution Steps
Step 1: Prepare Input Data
Generate or upload text data to HDFS for processing. The data generator utility creates simulated word frequency data and writes it to HDFS at a specified path. This establishes the input dataset that the MapReduce job will consume.
Key considerations:
- Ensure HDFS is running and accessible
- Data should be in text format with words separated by whitespace
- Verify the input path exists in HDFS before submitting the job
Step 2: Implement the Mapper
Write a Mapper class that extends Hadoop's Mapper base class with appropriate input/output key-value types. The Mapper receives each line of text, tokenizes it into individual words, and emits each word as a key with a count of one as the value.
What happens:
- Input: line offset (LongWritable) and line text (Text)
- Processing: split line into words
- Output: word (Text) as key, count of 1 (IntWritable) as value
Step 3: Implement the Reducer
Write a Reducer class that receives grouped key-value pairs from the shuffle phase. For each unique word, the Reducer iterates over all associated counts and sums them to produce the final word frequency.
What happens:
- Input: word (Text) and list of counts (IntWritable)
- Processing: sum all count values for each word
- Output: word (Text) and total count (IntWritable)
Step 4: Configure Optional Combiner
Optionally add a Combiner to perform local aggregation on each Mapper node before the shuffle phase. The Combiner uses the same logic as the Reducer but runs on the map side, reducing network traffic by pre-aggregating partial results.
Key considerations:
- The Combiner must be commutative and associative
- For word count, the Reducer class can be reused as the Combiner
- Not all MapReduce algorithms support Combiners
Step 5: Configure Optional Custom Partitioner
Optionally implement a custom Partitioner to control how map output keys are distributed across Reducer tasks. This allows directing specific words to specific Reducers, enabling controlled output file organization.
Key considerations:
- Default HashPartitioner distributes keys by hash code modulo number of reducers
- Custom partitioners enable domain-specific distribution logic
- Number of reduce tasks must match the partitioner's expected partition count
Step 6: Assemble and Submit the Job
Create a Job configuration object, set the Mapper, Reducer, Combiner, and Partitioner classes, configure input/output paths on HDFS, and submit the job to the YARN cluster for execution.
What happens:
- Configure Job with mapper, reducer, combiner, partitioner classes
- Set input format and output format
- Set output key and value types
- Specify HDFS input and output paths
- Submit job and wait for completion