Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Heibaiying BigData Notes Prepare Input Data

From Leeroopedia


Knowledge Sources
Domains Distributed_Computing, Big_Data
Last Updated 2026-02-10 10:00 GMT

Overview

Data preparation for MapReduce involves creating structured input files on HDFS that mappers can split and process in parallel.

Description

Before a MapReduce job can execute, input data must be staged on the Hadoop Distributed File System (HDFS). The quality and structure of this input data directly affects the efficiency and correctness of downstream map and reduce tasks. In a word count workflow, input data typically consists of text files where each line contains one or more words separated by a delimiter (such as a tab character).

Generating test data is an essential step during development and testing of MapReduce applications. A data generation utility constructs synthetic input by sampling from a predefined vocabulary and writing the resulting lines to HDFS. This approach allows developers to validate their MapReduce pipeline end-to-end without relying on external data sources.

HDFS stores input files in configurable block sizes (default 128 MB), and the InputFormat class determines how blocks are divided into InputSplits. Each InputSplit is assigned to exactly one mapper. Therefore, the size and structure of the input data influence how many mappers are launched and how evenly work is distributed across the cluster.

Usage

Use data preparation utilities when:

  • You need repeatable, deterministic test data for validating a MapReduce pipeline.
  • You are bootstrapping a development environment and require sample input on HDFS.
  • You want to control the vocabulary and distribution of words to test specific mapper or reducer behaviors.
  • You need to benchmark job performance with a known input size.

Theoretical Basis

The data preparation phase can be described in the following logical steps:

  1. Define vocabulary: Select a finite set of words W = {w1, w2, ..., wn} that will appear in the generated data.
  2. Generate lines: For each line i from 1 to N (where N is the desired number of lines), randomly shuffle the vocabulary and concatenate words using a delimiter (e.g., tab character).
  3. Write to local buffer: Accumulate generated lines in memory or in a local temporary file.
  4. Upload to HDFS: Using the Hadoop FileSystem API, create an output stream to the target HDFS path and write the generated data.
  5. Verify: Confirm the file exists on HDFS and is readable by the MapReduce framework.

The total number of words generated is N x |W| (number of lines multiplied by vocabulary size). For a vocabulary of 6 words and 1000 lines, the output contains 6000 word occurrences, with each word appearing approximately 1000 times (since every line contains every word in shuffled order).

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment