Workflow:Google research Deduplicate text datasets Wiki40B TFDS deduplication

Knowledge Sources	Deduplicate Text Datasets Deduplicating Training Data Makes Language Models Better TensorFlow Datasets
Domains	Data_Engineering, NLP, Deduplication
Last Updated	2026-02-14 21:00 GMT

Overview

End-to-end process for deduplicating a TensorFlow Datasets (TFDS) dataset such as Wiki40B, from loading through suffix array construction to producing a clean deduplicated TFDS output.

Description

This workflow implements the canonical deduplication pipeline described in the paper. It loads a dataset from TensorFlow Datasets, serializes all examples into a single flat binary file with separator tokens and unique IDs, builds a suffix array, identifies all repeated substrings, collects the duplicate byte ranges, and then maps those byte ranges back to individual examples to produce a new deduplicated TFDS dataset. The separator tokens prevent false cross-example matches in the suffix array.

Key characteristics:

Full pipeline from TFDS input to TFDS output
Separator-based serialization prevents cross-example suffix array matches
Size file tracks byte offsets for mapping back to individual examples
Output is a valid TFDS dataset that can be loaded normally
Optional GPT-2 or T5 tokenization to reduce dataset size by ~2x
Supports iterative deduplication (run twice to catch newly-created duplicates)

Usage

Execute this workflow when you have a TensorFlow Datasets dataset (particularly Wiki40B) and need to produce a deduplicated version as a valid TFDS dataset. This is the primary workflow demonstrated in the paper and is suitable for datasets that are loaded via the TFDS API. For HuggingFace datasets, substitute the loading step with the HuggingFace loader. For arbitrary single files, use the Single File Deduplication workflow instead.

Execution Steps

Step 1: Load and serialize dataset

Load the dataset from TensorFlow Datasets and serialize all text examples into a single flat binary file. Each example is prefixed with a two-byte separator marker (0xFF 0xFF) followed by a 4-byte unique identifier to prevent the suffix array from finding spurious cross-example matches. A companion size file records the byte offset where each example begins.

Key considerations:

The separator/UID scheme ensures the suffix array never matches across example boundaries
Optional tokenization (GPT-2 or T5) can halve the file size, but the length threshold must be doubled to compensate
The size file is essential for the final step to map byte ranges back to individual examples
Uses multiprocessing for parallel tokenization when enabled

Step 2: Build suffix array

Construct a suffix array for the serialized binary file using parallel chunked construction. The file is split into overlapping chunks, suffix arrays are built independently for each chunk, and the partial arrays are merged into a single complete suffix array.

Key considerations:

The number of parallel jobs scales automatically with file size
Chunks overlap by 100,000 bytes to handle matches spanning chunk boundaries
For large files, increase the open file limit with ulimit before running
Variable-width pointers minimize disk usage based on file size

Step 3: Find self-similar duplicates

Scan the suffix array to identify all substrings of at least the configured length threshold that appear more than once anywhere in the dataset. The algorithm streams through the suffix array in parallel, comparing adjacent entries to find clusters of identical substrings.

Key considerations:

The paper uses a length threshold of 100 bytes (50 tokens) for untokenized text
Output is written to a cache directory as dups (byte offsets) and sizes (cluster sizes) files
The number of threads should match available CPU cores for optimal performance

Step 4: Collect duplicate ranges

Merge the individual duplicate pointers into consolidated byte ranges [a, b) representing contiguous regions where every sub-window of threshold length is duplicated. Overlapping and adjacent regions are combined.

Key considerations:

Output is a text file of (start_byte, end_byte) pairs, one per line
This step reduces millions of individual duplicate pointers to a manageable list of byte ranges

Step 5: Apply deduplication to TFDS dataset

Map the byte ranges back to individual dataset examples using the size file from Step 1. For each affected example, excise the duplicate byte ranges from its text content. Rebuild the dataset as a valid TFDS dataset with the same schema as the original, ready for normal use.

Key considerations:

The size file maps global byte offsets back to per-example local offsets
A 6-byte adjustment accounts for the separator/UID prefix per example
The output TFDS dataset preserves the original schema (text, version_id, wikidata_id, etc.)
Currently only Wiki40B output formatting is implemented; other datasets require custom finish scripts
Running the full pipeline again on the output typically reduces remaining duplicates to near zero

Execution Diagram

GitHub URL

Workflow Repository