Principle:Heibaiying BigData Notes Prepare Input Data

Knowledge Sources	Hadoop MapReduce Tutorial
Domains	Distributed_Computing, Big_Data
Last Updated	2026-02-10 10:00 GMT

Overview

Data preparation for MapReduce involves creating structured input files on HDFS that mappers can split and process in parallel.

Description

Before a MapReduce job can execute, input data must be staged on the Hadoop Distributed File System (HDFS). The quality and structure of this input data directly affects the efficiency and correctness of downstream map and reduce tasks. In a word count workflow, input data typically consists of text files where each line contains one or more words separated by a delimiter (such as a tab character).

Generating test data is an essential step during development and testing of MapReduce applications. A data generation utility constructs synthetic input by sampling from a predefined vocabulary and writing the resulting lines to HDFS. This approach allows developers to validate their MapReduce pipeline end-to-end without relying on external data sources.

HDFS stores input files in configurable block sizes (default 128 MB), and the InputFormat class determines how blocks are divided into InputSplits. Each InputSplit is assigned to exactly one mapper. Therefore, the size and structure of the input data influence how many mappers are launched and how evenly work is distributed across the cluster.

Usage

Use data preparation utilities when:

You need repeatable, deterministic test data for validating a MapReduce pipeline.
You are bootstrapping a development environment and require sample input on HDFS.
You want to control the vocabulary and distribution of words to test specific mapper or reducer behaviors.
You need to benchmark job performance with a known input size.

Theoretical Basis

The data preparation phase can be described in the following logical steps:

Define vocabulary: Select a finite set of words W = {w1, w2, ..., wn} that will appear in the generated data.
Generate lines: For each line i from 1 to N (where N is the desired number of lines), randomly shuffle the vocabulary and concatenate words using a delimiter (e.g., tab character).
Write to local buffer: Accumulate generated lines in memory or in a local temporary file.
Upload to HDFS: Using the Hadoop FileSystem API, create an output stream to the target HDFS path and write the generated data.
Verify: Confirm the file exists on HDFS and is readable by the MapReduce framework.

The total number of words generated is N x |W| (number of lines multiplied by vocabulary size). For a vocabulary of 6 words and 1000 lines, the output contains 6000 word occurrences, with each word appearing approximately 1000 times (since every line contains every word in shuffled order).

Related Pages

Implemented By

Implementation:Heibaiying_BigData_Notes_WordCountDataUtils_GenerateDataToHDFS

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment