Implementation:Heibaiying BigData Notes WordCountDataUtils GenerateDataToHDFS
| Knowledge Sources | |
|---|---|
| Domains | Distributed_Computing, Big_Data |
| Last Updated | 2026-02-10 10:00 GMT |
Overview
Concrete tool for generating test word count data and writing it to HDFS provided by the BigData-Notes repository.
Description
The WordCountDataUtils utility class provides methods for generating synthetic word count input data and uploading it to HDFS. The class maintains a static vocabulary list of six distributed computing framework names: Spark, Hadoop, HBase, Storm, Flink, and Hive.
The generateData() method creates 1000 lines of tab-delimited text, where each line contains all six words in a randomly shuffled order. The generateDataToHDFS() method connects to an HDFS cluster, creates an output file, and writes the generated data to the specified HDFS path using the Hadoop FileSystem API.
Usage
Use this utility when you need to create repeatable test input data for the word count MapReduce pipeline. It is typically invoked before running the WordCountApp job to ensure input data exists on HDFS.
Code Reference
Source Location
- Repository: BigData-Notes
- File: code/Hadoop/hadoop-word-count/src/main/java/com/heibaiying/utils/WordCountDataUtils.java
- Lines: L22-91
Signature
public class WordCountDataUtils {
public static final List<String> WORD_LIST = Arrays.asList(
"Spark", "Hadoop", "HBase", "Storm", "Flink", "Hive"
);
public static void generateDataToHDFS(String hdfsUrl, String user, String outputPathString)
throws IOException, InterruptedException, URISyntaxException
}
Import
import com.heibaiying.utils.WordCountDataUtils;
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| hdfsUrl | String | Yes | The HDFS namenode URL (e.g., hdfs://hadoop001:8020) |
| user | String | Yes | The HDFS user name for authentication |
| outputPathString | String | Yes | The HDFS path where the generated data file will be written |
Outputs
| Name | Type | Description |
|---|---|---|
| void | void | The method writes a file to HDFS as a side effect; no return value |
| HDFS file | Text file | A file at outputPathString containing 1000 lines of tab-delimited shuffled words |
Usage Examples
Basic Usage
import com.heibaiying.utils.WordCountDataUtils;
// Generate test data and write to HDFS
String hdfsUrl = "hdfs://hadoop001:8020";
String user = "root";
String outputPath = "/wordcount/input/data.txt";
WordCountDataUtils.generateDataToHDFS(hdfsUrl, user, outputPath);
Understanding Generated Data Format
// Each line contains all 6 words in random order, tab-separated.
// Example output lines:
// HBase Flink Spark Hadoop Storm Hive
// Hadoop Storm HBase Flink Hive Spark
// Storm Hive Hadoop Spark Flink HBase