Implementation:Google research Deduplicate text datasets Load Dataset TFDS
| Knowledge Sources | |
|---|---|
| Domains | Data_Processing, NLP, Text_Deduplication |
| Last Updated | 2026-02-14 21:00 GMT |
Overview
Concrete tool for loading and serializing TensorFlow Datasets into flat binary files for suffix array deduplication provided by the deduplicate-text-datasets repository.
Description
The load_dataset.py script uses tensorflow_datasets.load() to load a named dataset, then iterates over batches (batch size 2^16) to serialize each example into a flat binary file. Each example is preceded by a unique separator (\xff\xff + 4-byte UID). The script supports optional tokenization via GPT-2 or T5 tokenizers, converting text to uint16 token IDs packed as bytes. Tokenization is parallelized using Python multiprocessing.
Two output files are produced: the flat binary data file and a .size file containing cumulative byte offsets as uint64 values.
Usage
Use this script as the first step in any TFDS-based deduplication pipeline (Wiki40B, cross-dataset, or suffix array querying). For HuggingFace Hub datasets, use load_dataset_hf.py instead.
Code Reference
Source Location
- Repository: deduplicate-text-datasets
- File: scripts/load_dataset.py (L1-93)
Signature
python3 scripts/load_dataset.py \
--data_dir <tfds_data_directory> \
--save_dir <output_directory> \
--name <dataset_name> \
--split <split_name> \
[--tokenize] \
[--tokenizer gpt2|t5] \
[--pre_sep <bytes>] \
[--post_sep <bytes>]
Import
# CLI script, not importable as a library.
python3 scripts/load_dataset.py --data_dir <dir> --save_dir <dir> --name <name> --split <split>
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| --data_dir | str | Yes | TFDS data directory (passed to tfds.load data_dir parameter) |
| --save_dir | str | Yes | Output directory for serialized files |
| --name | str | Yes | Dataset name (e.g., "wiki40b") |
| --split | str | Yes | Split name (e.g., "test", "train") |
| --tokenize | flag | No | If set, tokenize text to uint16 token IDs |
| --tokenizer | str | No | Tokenizer choice: "gpt2" (default) or "t5" |
| --pre_sep | bytes | No | Example separator prefix (default: \xff\xff) |
| --post_sep | bytes | No | Example separator suffix (default: empty) |
Outputs
| Name | Type | Description |
|---|---|---|
| <save_dir>/<name>.<split> | Binary file | Flat binary file with separator-delimited examples |
| <save_dir>/<name>.<split>.size | Binary file | uint64 array of cumulative byte offsets per example boundary |
Usage Examples
Serialize Wiki40B Test Split
# Load Wiki40B test split from TFDS and serialize to binary
python3 scripts/load_dataset.py \
--data_dir /data/tensorflow_datasets \
--save_dir /data/suffix_arrays \
--name wiki40b \
--split test
# Output files:
# /data/suffix_arrays/wiki40b.test (flat binary)
# /data/suffix_arrays/wiki40b.test.size (offset array)
Serialize with Tokenization
# Serialize with GPT-2 tokenization for token-level deduplication
python3 scripts/load_dataset.py \
--data_dir /data/tensorflow_datasets \
--save_dir /data/suffix_arrays \
--name wiki40b \
--split train \
--tokenize \
--tokenizer gpt2
Verify Output
import numpy as np
import os
# Check the size file
size_path = "/data/suffix_arrays/wiki40b.test.size"
sizes = np.frombuffer(open(size_path, "rb").read(), dtype=np.uint64)
print(f"Number of examples: {len(sizes) - 1}")
print(f"Total bytes: {sizes[-1]}")
# Verify data file matches
data_path = "/data/suffix_arrays/wiki40b.test"
assert os.path.getsize(data_path) == sizes[-1], "Size mismatch"