Workflow:Google research Deduplicate text datasets Single file deduplication
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, NLP, Deduplication |
| Last Updated | 2026-02-14 21:00 GMT |
Overview
End-to-end process for removing exact substring duplicates from a single flat binary file using suffix array construction and parallel duplicate detection.
Description
This workflow provides the simplest path for deduplicating a single file. Given a raw binary file containing text data, it constructs a suffix array to efficiently locate all repeated substrings above a configurable length threshold. The duplicate byte ranges are then collected and excised from the original file, producing a clean deduplicated output.
Key characteristics:
- Operates on a single pre-existing binary file (no dataset loading step)
- Uses parallel suffix array construction to handle files larger than available RAM
- Identifies all repeated substrings above a configurable byte-length threshold
- Produces a new file with duplicate byte ranges removed
- Can be run iteratively to catch newly-created duplicates from prior passes
Usage
Execute this workflow when you have a single flat binary file (text or tokenized) and need to remove all exact substring duplicates above a given length. This is appropriate when you already have your data serialized to a single file and do not need dataset-specific loading or output formatting. Typical use cases include cleaning a pre-tokenized training corpus or deduplicating a raw text dump.
Execution Steps
Step 1: Build suffix array
Construct a suffix array for the input file using parallel chunked construction. The file is split into overlapping chunks, a suffix array is built independently for each chunk, and the partial suffix arrays are merged into a single complete suffix array covering the entire file.
Key considerations:
- The number of parallel jobs scales with file size (1 job for files under 10MB, up to 100 jobs for files over 10GB)
- Chunks overlap by 100,000 bytes (HACKSIZE) to handle cross-boundary matches
- Variable-width pointers are used to minimize disk usage (4 bytes for files under 4GB, 5 bytes for files under 1TB, etc.)
- If the process fails due to too many open file handles, increase the limit with ulimit
Step 2: Find self-similar duplicates
Scan the suffix array to identify all substrings of at least the specified length threshold that appear more than once in the file. This step streams through the suffix array in parallel, comparing adjacent entries to find clusters of identical substrings.
Key considerations:
- The length threshold is dataset-dependent (the paper uses 50 tokens, which equals 100 bytes for untokenized text)
- Output consists of dups files (byte offsets of duplicate occurrences) and sizes files (cluster sizes) written to a cache directory
- All pointers use the minimum byte width needed to address the dataset
- Execution is embarrassingly parallel across suffix array ranges
Step 3: Collect duplicate ranges
Merge the individual length-threshold duplicate pointers into consolidated byte ranges [a, b) that should be removed from the file. Overlapping and adjacent duplicate regions are merged into contiguous ranges.
Key considerations:
- Each byte in an output range is a member of at least one length-threshold duplicate match
- The full substring containing a range may not itself be fully duplicated, but every sub-window of threshold length within the range is
- Output is a list of (start_byte, end_byte) pairs written to stdout
Step 4: Remove duplicate byte ranges
Read the original file and produce a new output file with all identified duplicate byte ranges excised. The script iterates through the removal list, copying non-duplicate regions and skipping over duplicate ranges.
Key considerations:
- The removal slightly breaks text flow at excision boundaries (e.g., "Alice wanted to go to the store" may become "Alice wantedstore")
- In practice this does not harm language model training because the removed fraction is small
- The process can be repeated: running the full pipeline on the deduplicated output typically reduces remaining duplicates by over 100,000x