Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Heibaiying BigData Notes Hadoop MapReduce Word Count

From Leeroopedia


Knowledge Sources
Domains Big_Data, Batch_Processing, Hadoop
Last Updated 2026-02-10 10:00 GMT

Overview

End-to-end process for building and running a Hadoop MapReduce word count application, from writing Mapper and Reducer components to submitting the job on an HDFS cluster.

Description

This workflow outlines the standard procedure for developing a distributed batch processing job using the Hadoop MapReduce framework. It covers writing a Mapper to tokenize input text, a Reducer to aggregate word counts, and optionally adding a Combiner for local pre-aggregation and a custom Partitioner for controlled output distribution. The process demonstrates the full MapReduce lifecycle: Input splitting, mapping, shuffling, reducing, and writing output to HDFS.

Usage

Execute this workflow when you need to process large volumes of text data stored in HDFS and want to count word frequencies using distributed batch computation. This is the canonical "Hello World" of big data processing and serves as the foundation for understanding all MapReduce-based pipelines.

Execution Steps

Step 1: Prepare Input Data

Generate or upload text data to HDFS for processing. The data generator utility creates simulated word frequency data and writes it to HDFS at a specified path. This establishes the input dataset that the MapReduce job will consume.

Key considerations:

  • Ensure HDFS is running and accessible
  • Data should be in text format with words separated by whitespace
  • Verify the input path exists in HDFS before submitting the job

Step 2: Implement the Mapper

Write a Mapper class that extends Hadoop's Mapper base class with appropriate input/output key-value types. The Mapper receives each line of text, tokenizes it into individual words, and emits each word as a key with a count of one as the value.

What happens:

  • Input: line offset (LongWritable) and line text (Text)
  • Processing: split line into words
  • Output: word (Text) as key, count of 1 (IntWritable) as value

Step 3: Implement the Reducer

Write a Reducer class that receives grouped key-value pairs from the shuffle phase. For each unique word, the Reducer iterates over all associated counts and sums them to produce the final word frequency.

What happens:

  • Input: word (Text) and list of counts (IntWritable)
  • Processing: sum all count values for each word
  • Output: word (Text) and total count (IntWritable)

Step 4: Configure Optional Combiner

Optionally add a Combiner to perform local aggregation on each Mapper node before the shuffle phase. The Combiner uses the same logic as the Reducer but runs on the map side, reducing network traffic by pre-aggregating partial results.

Key considerations:

  • The Combiner must be commutative and associative
  • For word count, the Reducer class can be reused as the Combiner
  • Not all MapReduce algorithms support Combiners

Step 5: Configure Optional Custom Partitioner

Optionally implement a custom Partitioner to control how map output keys are distributed across Reducer tasks. This allows directing specific words to specific Reducers, enabling controlled output file organization.

Key considerations:

  • Default HashPartitioner distributes keys by hash code modulo number of reducers
  • Custom partitioners enable domain-specific distribution logic
  • Number of reduce tasks must match the partitioner's expected partition count

Step 6: Assemble and Submit the Job

Create a Job configuration object, set the Mapper, Reducer, Combiner, and Partitioner classes, configure input/output paths on HDFS, and submit the job to the YARN cluster for execution.

What happens:

  • Configure Job with mapper, reducer, combiner, partitioner classes
  • Set input format and output format
  • Set output key and value types
  • Specify HDFS input and output paths
  • Submit job and wait for completion

Execution Diagram

GitHub URL

Workflow Repository