Workflow:Google research Deduplicate text datasets Single file deduplication

Knowledge Sources	Deduplicate Text Datasets Deduplicating Training Data Makes Language Models Better
Domains	Data_Engineering, NLP, Deduplication
Last Updated	2026-02-14 21:00 GMT

Overview

End-to-end process for removing exact substring duplicates from a single flat binary file using suffix array construction and parallel duplicate detection.

Description

This workflow provides the simplest path for deduplicating a single file. Given a raw binary file containing text data, it constructs a suffix array to efficiently locate all repeated substrings above a configurable length threshold. The duplicate byte ranges are then collected and excised from the original file, producing a clean deduplicated output.

Key characteristics:

Operates on a single pre-existing binary file (no dataset loading step)
Uses parallel suffix array construction to handle files larger than available RAM
Identifies all repeated substrings above a configurable byte-length threshold
Produces a new file with duplicate byte ranges removed
Can be run iteratively to catch newly-created duplicates from prior passes

Usage

Execute this workflow when you have a single flat binary file (text or tokenized) and need to remove all exact substring duplicates above a given length. This is appropriate when you already have your data serialized to a single file and do not need dataset-specific loading or output formatting. Typical use cases include cleaning a pre-tokenized training corpus or deduplicating a raw text dump.

Execution Steps

Step 1: Build suffix array

Construct a suffix array for the input file using parallel chunked construction. The file is split into overlapping chunks, a suffix array is built independently for each chunk, and the partial suffix arrays are merged into a single complete suffix array covering the entire file.

Key considerations:

The number of parallel jobs scales with file size (1 job for files under 10MB, up to 100 jobs for files over 10GB)
Chunks overlap by 100,000 bytes (HACKSIZE) to handle cross-boundary matches
Variable-width pointers are used to minimize disk usage (4 bytes for files under 4GB, 5 bytes for files under 1TB, etc.)
If the process fails due to too many open file handles, increase the limit with ulimit

Step 2: Find self-similar duplicates

Scan the suffix array to identify all substrings of at least the specified length threshold that appear more than once in the file. This step streams through the suffix array in parallel, comparing adjacent entries to find clusters of identical substrings.

Key considerations:

The length threshold is dataset-dependent (the paper uses 50 tokens, which equals 100 bytes for untokenized text)
Output consists of dups files (byte offsets of duplicate occurrences) and sizes files (cluster sizes) written to a cache directory
All pointers use the minimum byte width needed to address the dataset
Execution is embarrassingly parallel across suffix array ranges

Step 3: Collect duplicate ranges

Merge the individual length-threshold duplicate pointers into consolidated byte ranges [a, b) that should be removed from the file. Overlapping and adjacent duplicate regions are merged into contiguous ranges.

Key considerations:

Each byte in an output range is a member of at least one length-threshold duplicate match
The full substring containing a range may not itself be fully duplicated, but every sub-window of threshold length within the range is
Output is a list of (start_byte, end_byte) pairs written to stdout

Step 4: Remove duplicate byte ranges

Read the original file and produce a new output file with all identified duplicate byte ranges excised. The script iterates through the removal list, copying non-duplicate regions and skipping over duplicate ranges.

Key considerations:

The removal slightly breaks text flow at excision boundaries (e.g., "Alice wanted to go to the store" may become "Alice wantedstore")
In practice this does not harm language model training because the removed fraction is small
The process can be repeated: running the full pipeline on the deduplicated output typically reduces remaining duplicates by over 100,000x

Execution Diagram

GitHub URL

Workflow Repository