Implementation:Google research Deduplicate text datasets Load Dataset HF

Knowledge Sources	deduplicate-text-datasets HuggingFace Datasets Deduplicating Training Data Makes Language Models Better
Domains	Data_Processing, NLP, Text_Deduplication
Last Updated	2026-02-14 21:00 GMT

Overview

Concrete tool for loading and serializing HuggingFace datasets into flat binary files for suffix array deduplication provided by the deduplicate-text-datasets repository.

Description

The load_dataset_hf.py script uses the HuggingFace datasets library to load a named dataset, then serializes it into a flat binary file with \xff\xff + 4-byte UID separators. It supports Hub datasets, local text/JSON/CSV files (via a FILE_EXTENSIONS mapping), optional GPT-2 tokenization with configurable parallel workers, and a configurable text feature key.

The script produces the same output format as the TFDS variant: a flat binary data file and a .size file with uint64 cumulative offsets.

Usage

Use this script as the first step in deduplication when working with HuggingFace Hub datasets, local text files, local JSONL files, or local CSV files. It is the preferred loader for non-TFDS datasets.

Code Reference

Source Location

Repository: deduplicate-text-datasets
File: scripts/load_dataset_hf.py (L1-91)

Signature

python3 scripts/load_dataset_hf.py \
    --save_dir <output_directory> \
    --name <dataset_name> \
    --split <split_name> \
    [--data_dir <local_data_directory>] \
    [--subset <subset_name>] \
    [--tokenize] \
    [--num_workers <n>] \
    [--text_feature_key <key>]

Import

# CLI script, not importable as a library.
python3 scripts/load_dataset_hf.py --save_dir <dir> --name <name> --split <split>

I/O Contract

Inputs

Name	Type	Required	Description
--save_dir	str	Yes	Output directory for serialized files
--name	str	Yes	Dataset name (HuggingFace Hub name, or "text"/"json"/"csv" for local files)
--split	str	Yes	Split name (e.g., "train", "test")
--data_dir	str	No	Local data directory (required when --name is "text", "json", or "csv")
--subset	str	No	Dataset subset/config name (default: None)
--tokenize	flag	No	If set, tokenize text to GPT-2 uint16 token IDs
--num_workers	int	No	Number of parallel workers for tokenization (default: None = single process)
--text_feature_key	str	No	Name of the text feature column (default: "text")

Outputs

Name	Type	Description
<save_dir>/<name>.<split>	Binary file	Flat binary file with separator-delimited examples
<save_dir>/<name>.<split>.size	Binary file	uint64 array of cumulative byte offsets per example boundary

Usage Examples

Serialize a HuggingFace Hub Dataset

# Load and serialize the C4 validation split
python3 scripts/load_dataset_hf.py \
    --save_dir /data/suffix_arrays \
    --name c4 \
    --split validation \
    --subset en

# Output files:
# /data/suffix_arrays/c4.validation       (flat binary)
# /data/suffix_arrays/c4.validation.size  (offset array)

Serialize Local Text Files

# Load all .txt files from a local directory
python3 scripts/load_dataset_hf.py \
    --save_dir /data/suffix_arrays \
    --name text \
    --data_dir /data/my_corpus \
    --split train

Serialize with Tokenization

# Serialize with GPT-2 tokenization using 8 parallel workers
python3 scripts/load_dataset_hf.py \
    --save_dir /data/suffix_arrays \
    --name openwebtext \
    --split train \
    --tokenize \
    --num_workers 8

Custom Text Feature Key

# Dataset where text is in "content" column instead of "text"
python3 scripts/load_dataset_hf.py \
    --save_dir /data/suffix_arrays \
    --name my_dataset \
    --split train \
    --text_feature_key content

Related Pages

Implements Principle

Principle:Google_research_Deduplicate_text_datasets_Dataset_Serialization_HF

Requires Environment

Environment:Google_research_Deduplicate_text_datasets_Python_HuggingFace_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment