Implementation:Google research Deduplicate text datasets Load Dataset HF
| Knowledge Sources | |
|---|---|
| Domains | Data_Processing, NLP, Text_Deduplication |
| Last Updated | 2026-02-14 21:00 GMT |
Overview
Concrete tool for loading and serializing HuggingFace datasets into flat binary files for suffix array deduplication provided by the deduplicate-text-datasets repository.
Description
The load_dataset_hf.py script uses the HuggingFace datasets library to load a named dataset, then serializes it into a flat binary file with \xff\xff + 4-byte UID separators. It supports Hub datasets, local text/JSON/CSV files (via a FILE_EXTENSIONS mapping), optional GPT-2 tokenization with configurable parallel workers, and a configurable text feature key.
The script produces the same output format as the TFDS variant: a flat binary data file and a .size file with uint64 cumulative offsets.
Usage
Use this script as the first step in deduplication when working with HuggingFace Hub datasets, local text files, local JSONL files, or local CSV files. It is the preferred loader for non-TFDS datasets.
Code Reference
Source Location
- Repository: deduplicate-text-datasets
- File: scripts/load_dataset_hf.py (L1-91)
Signature
python3 scripts/load_dataset_hf.py \
--save_dir <output_directory> \
--name <dataset_name> \
--split <split_name> \
[--data_dir <local_data_directory>] \
[--subset <subset_name>] \
[--tokenize] \
[--num_workers <n>] \
[--text_feature_key <key>]
Import
# CLI script, not importable as a library.
python3 scripts/load_dataset_hf.py --save_dir <dir> --name <name> --split <split>
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| --save_dir | str | Yes | Output directory for serialized files |
| --name | str | Yes | Dataset name (HuggingFace Hub name, or "text"/"json"/"csv" for local files) |
| --split | str | Yes | Split name (e.g., "train", "test") |
| --data_dir | str | No | Local data directory (required when --name is "text", "json", or "csv") |
| --subset | str | No | Dataset subset/config name (default: None) |
| --tokenize | flag | No | If set, tokenize text to GPT-2 uint16 token IDs |
| --num_workers | int | No | Number of parallel workers for tokenization (default: None = single process) |
| --text_feature_key | str | No | Name of the text feature column (default: "text") |
Outputs
| Name | Type | Description |
|---|---|---|
| <save_dir>/<name>.<split> | Binary file | Flat binary file with separator-delimited examples |
| <save_dir>/<name>.<split>.size | Binary file | uint64 array of cumulative byte offsets per example boundary |
Usage Examples
Serialize a HuggingFace Hub Dataset
# Load and serialize the C4 validation split
python3 scripts/load_dataset_hf.py \
--save_dir /data/suffix_arrays \
--name c4 \
--split validation \
--subset en
# Output files:
# /data/suffix_arrays/c4.validation (flat binary)
# /data/suffix_arrays/c4.validation.size (offset array)
Serialize Local Text Files
# Load all .txt files from a local directory
python3 scripts/load_dataset_hf.py \
--save_dir /data/suffix_arrays \
--name text \
--data_dir /data/my_corpus \
--split train
Serialize with Tokenization
# Serialize with GPT-2 tokenization using 8 parallel workers
python3 scripts/load_dataset_hf.py \
--save_dir /data/suffix_arrays \
--name openwebtext \
--split train \
--tokenize \
--num_workers 8
Custom Text Feature Key
# Dataset where text is in "content" column instead of "text"
python3 scripts/load_dataset_hf.py \
--save_dir /data/suffix_arrays \
--name my_dataset \
--split train \
--text_feature_key content