Implementation:Google research Deduplicate text datasets Load Dataset TFDS

Knowledge Sources	deduplicate-text-datasets TensorFlow Datasets Deduplicating Training Data Makes Language Models Better
Domains	Data_Processing, NLP, Text_Deduplication
Last Updated	2026-02-14 21:00 GMT

Overview

Concrete tool for loading and serializing TensorFlow Datasets into flat binary files for suffix array deduplication provided by the deduplicate-text-datasets repository.

Description

The load_dataset.py script uses tensorflow_datasets.load() to load a named dataset, then iterates over batches (batch size 2^16) to serialize each example into a flat binary file. Each example is preceded by a unique separator (\xff\xff + 4-byte UID). The script supports optional tokenization via GPT-2 or T5 tokenizers, converting text to uint16 token IDs packed as bytes. Tokenization is parallelized using Python multiprocessing.

Two output files are produced: the flat binary data file and a .size file containing cumulative byte offsets as uint64 values.

Usage

Use this script as the first step in any TFDS-based deduplication pipeline (Wiki40B, cross-dataset, or suffix array querying). For HuggingFace Hub datasets, use load_dataset_hf.py instead.

Code Reference

Source Location

Repository: deduplicate-text-datasets
File: scripts/load_dataset.py (L1-93)

Signature

python3 scripts/load_dataset.py \
    --data_dir <tfds_data_directory> \
    --save_dir <output_directory> \
    --name <dataset_name> \
    --split <split_name> \
    [--tokenize] \
    [--tokenizer gpt2|t5] \
    [--pre_sep <bytes>] \
    [--post_sep <bytes>]

Import

# CLI script, not importable as a library.
python3 scripts/load_dataset.py --data_dir <dir> --save_dir <dir> --name <name> --split <split>

I/O Contract

Inputs

Name	Type	Required	Description
--data_dir	str	Yes	TFDS data directory (passed to tfds.load data_dir parameter)
--save_dir	str	Yes	Output directory for serialized files
--name	str	Yes	Dataset name (e.g., "wiki40b")
--split	str	Yes	Split name (e.g., "test", "train")
--tokenize	flag	No	If set, tokenize text to uint16 token IDs
--tokenizer	str	No	Tokenizer choice: "gpt2" (default) or "t5"
--pre_sep	bytes	No	Example separator prefix (default: \xff\xff)
--post_sep	bytes	No	Example separator suffix (default: empty)

Outputs

Name	Type	Description
<save_dir>/<name>.<split>	Binary file	Flat binary file with separator-delimited examples
<save_dir>/<name>.<split>.size	Binary file	uint64 array of cumulative byte offsets per example boundary

Usage Examples

Serialize Wiki40B Test Split

# Load Wiki40B test split from TFDS and serialize to binary
python3 scripts/load_dataset.py \
    --data_dir /data/tensorflow_datasets \
    --save_dir /data/suffix_arrays \
    --name wiki40b \
    --split test

# Output files:
# /data/suffix_arrays/wiki40b.test       (flat binary)
# /data/suffix_arrays/wiki40b.test.size  (offset array)

Serialize with Tokenization

# Serialize with GPT-2 tokenization for token-level deduplication
python3 scripts/load_dataset.py \
    --data_dir /data/tensorflow_datasets \
    --save_dir /data/suffix_arrays \
    --name wiki40b \
    --split train \
    --tokenize \
    --tokenizer gpt2

Verify Output

import numpy as np
import os

# Check the size file
size_path = "/data/suffix_arrays/wiki40b.test.size"
sizes = np.frombuffer(open(size_path, "rb").read(), dtype=np.uint64)

print(f"Number of examples: {len(sizes) - 1}")
print(f"Total bytes: {sizes[-1]}")

# Verify data file matches
data_path = "/data/suffix_arrays/wiki40b.test"
assert os.path.getsize(data_path) == sizes[-1], "Size mismatch"

Related Pages

Implements Principle

Principle:Google_research_Deduplicate_text_datasets_Dataset_Serialization_TFDS

Requires Environment

Environment:Google_research_Deduplicate_text_datasets_Python_TFDS_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment