Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Google research Deduplicate text datasets Load Dataset TFDS

From Leeroopedia
Knowledge Sources
Domains Data_Processing, NLP, Text_Deduplication
Last Updated 2026-02-14 21:00 GMT

Overview

Concrete tool for loading and serializing TensorFlow Datasets into flat binary files for suffix array deduplication provided by the deduplicate-text-datasets repository.

Description

The load_dataset.py script uses tensorflow_datasets.load() to load a named dataset, then iterates over batches (batch size 2^16) to serialize each example into a flat binary file. Each example is preceded by a unique separator (\xff\xff + 4-byte UID). The script supports optional tokenization via GPT-2 or T5 tokenizers, converting text to uint16 token IDs packed as bytes. Tokenization is parallelized using Python multiprocessing.

Two output files are produced: the flat binary data file and a .size file containing cumulative byte offsets as uint64 values.

Usage

Use this script as the first step in any TFDS-based deduplication pipeline (Wiki40B, cross-dataset, or suffix array querying). For HuggingFace Hub datasets, use load_dataset_hf.py instead.

Code Reference

Source Location

Signature

python3 scripts/load_dataset.py \
    --data_dir <tfds_data_directory> \
    --save_dir <output_directory> \
    --name <dataset_name> \
    --split <split_name> \
    [--tokenize] \
    [--tokenizer gpt2|t5] \
    [--pre_sep <bytes>] \
    [--post_sep <bytes>]

Import

# CLI script, not importable as a library.
python3 scripts/load_dataset.py --data_dir <dir> --save_dir <dir> --name <name> --split <split>

I/O Contract

Inputs

Name Type Required Description
--data_dir str Yes TFDS data directory (passed to tfds.load data_dir parameter)
--save_dir str Yes Output directory for serialized files
--name str Yes Dataset name (e.g., "wiki40b")
--split str Yes Split name (e.g., "test", "train")
--tokenize flag No If set, tokenize text to uint16 token IDs
--tokenizer str No Tokenizer choice: "gpt2" (default) or "t5"
--pre_sep bytes No Example separator prefix (default: \xff\xff)
--post_sep bytes No Example separator suffix (default: empty)

Outputs

Name Type Description
<save_dir>/<name>.<split> Binary file Flat binary file with separator-delimited examples
<save_dir>/<name>.<split>.size Binary file uint64 array of cumulative byte offsets per example boundary

Usage Examples

Serialize Wiki40B Test Split

# Load Wiki40B test split from TFDS and serialize to binary
python3 scripts/load_dataset.py \
    --data_dir /data/tensorflow_datasets \
    --save_dir /data/suffix_arrays \
    --name wiki40b \
    --split test

# Output files:
# /data/suffix_arrays/wiki40b.test       (flat binary)
# /data/suffix_arrays/wiki40b.test.size  (offset array)

Serialize with Tokenization

# Serialize with GPT-2 tokenization for token-level deduplication
python3 scripts/load_dataset.py \
    --data_dir /data/tensorflow_datasets \
    --save_dir /data/suffix_arrays \
    --name wiki40b \
    --split train \
    --tokenize \
    --tokenizer gpt2

Verify Output

import numpy as np
import os

# Check the size file
size_path = "/data/suffix_arrays/wiki40b.test.size"
sizes = np.frombuffer(open(size_path, "rb").read(), dtype=np.uint64)

print(f"Number of examples: {len(sizes) - 1}")
print(f"Total bytes: {sizes[-1]}")

# Verify data file matches
data_path = "/data/suffix_arrays/wiki40b.test"
assert os.path.getsize(data_path) == sizes[-1], "Size mismatch"

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment