Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Google research Deduplicate text datasets Load Dataset HF

From Leeroopedia
Revision as of 12:48, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Google_research_Deduplicate_text_datasets_Load_Dataset_HF.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains Data_Processing, NLP, Text_Deduplication
Last Updated 2026-02-14 21:00 GMT

Overview

Concrete tool for loading and serializing HuggingFace datasets into flat binary files for suffix array deduplication provided by the deduplicate-text-datasets repository.

Description

The load_dataset_hf.py script uses the HuggingFace datasets library to load a named dataset, then serializes it into a flat binary file with \xff\xff + 4-byte UID separators. It supports Hub datasets, local text/JSON/CSV files (via a FILE_EXTENSIONS mapping), optional GPT-2 tokenization with configurable parallel workers, and a configurable text feature key.

The script produces the same output format as the TFDS variant: a flat binary data file and a .size file with uint64 cumulative offsets.

Usage

Use this script as the first step in deduplication when working with HuggingFace Hub datasets, local text files, local JSONL files, or local CSV files. It is the preferred loader for non-TFDS datasets.

Code Reference

Source Location

Signature

python3 scripts/load_dataset_hf.py \
    --save_dir <output_directory> \
    --name <dataset_name> \
    --split <split_name> \
    [--data_dir <local_data_directory>] \
    [--subset <subset_name>] \
    [--tokenize] \
    [--num_workers <n>] \
    [--text_feature_key <key>]

Import

# CLI script, not importable as a library.
python3 scripts/load_dataset_hf.py --save_dir <dir> --name <name> --split <split>

I/O Contract

Inputs

Name Type Required Description
--save_dir str Yes Output directory for serialized files
--name str Yes Dataset name (HuggingFace Hub name, or "text"/"json"/"csv" for local files)
--split str Yes Split name (e.g., "train", "test")
--data_dir str No Local data directory (required when --name is "text", "json", or "csv")
--subset str No Dataset subset/config name (default: None)
--tokenize flag No If set, tokenize text to GPT-2 uint16 token IDs
--num_workers int No Number of parallel workers for tokenization (default: None = single process)
--text_feature_key str No Name of the text feature column (default: "text")

Outputs

Name Type Description
<save_dir>/<name>.<split> Binary file Flat binary file with separator-delimited examples
<save_dir>/<name>.<split>.size Binary file uint64 array of cumulative byte offsets per example boundary

Usage Examples

Serialize a HuggingFace Hub Dataset

# Load and serialize the C4 validation split
python3 scripts/load_dataset_hf.py \
    --save_dir /data/suffix_arrays \
    --name c4 \
    --split validation \
    --subset en

# Output files:
# /data/suffix_arrays/c4.validation       (flat binary)
# /data/suffix_arrays/c4.validation.size  (offset array)

Serialize Local Text Files

# Load all .txt files from a local directory
python3 scripts/load_dataset_hf.py \
    --save_dir /data/suffix_arrays \
    --name text \
    --data_dir /data/my_corpus \
    --split train

Serialize with Tokenization

# Serialize with GPT-2 tokenization using 8 parallel workers
python3 scripts/load_dataset_hf.py \
    --save_dir /data/suffix_arrays \
    --name openwebtext \
    --split train \
    --tokenize \
    --num_workers 8

Custom Text Feature Key

# Dataset where text is in "content" column instead of "text"
python3 scripts/load_dataset_hf.py \
    --save_dir /data/suffix_arrays \
    --name my_dataset \
    --split train \
    --text_feature_key content

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment