Workflow:Google research Deduplicate text datasets Cross dataset deduplication
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, NLP, Deduplication |
| Last Updated | 2026-02-14 21:00 GMT |
Overview
End-to-end process for finding and removing exact substring duplicates that exist between two different datasets, such as detecting train/test overlap or cross-corpus contamination.
Description
This workflow identifies all substrings above a length threshold that appear in both of two distinct datasets. Unlike self-similar deduplication which finds repeats within a single file, cross-dataset deduplication performs a linear walk of both suffix arrays simultaneously to find shared content. This is essential for detecting data contamination, such as test set leakage into training data, or overlap between independently collected corpora.
Key characteristics:
- Compares two separate datasets by walking their suffix arrays in parallel
- Requires both datasets and their suffix arrays to fit in memory simultaneously
- Produces duplicate locations for both datasets (what in dataset A also appears in dataset B, and vice versa)
- Linear time complexity O(len(dataset1) + len(dataset2)) after suffix array construction
- More efficient than per-query lookup when comparing large amounts of text
Usage
Execute this workflow when you need to find shared content between two different datasets. Common use cases include detecting train/test set overlap (data contamination), finding duplicates between independently collected corpora, and auditing whether a new dataset contains content already present in existing training data. Both datasets must have pre-built suffix arrays, and the machine must have enough RAM to hold both datasets in memory.
Execution Steps
Step 1: Load and serialize both datasets
Load each dataset and serialize it into a flat binary file with separator tokens and unique IDs. This can use either the TFDS loader or the HuggingFace loader depending on the dataset source. Both datasets must be serialized independently, each producing its own binary file and size file.
Key considerations:
- Both datasets must use the same serialization format (same separator scheme)
- If using tokenization, both datasets must use the same tokenizer
- The HuggingFace loader supports loading from local files (text, JSON, CSV) or from the HuggingFace Hub
Step 2: Build suffix arrays for both datasets
Construct suffix arrays for each dataset independently using the parallel chunked construction process. Each dataset gets its own suffix array file.
Key considerations:
- Both suffix arrays must be fully built before the cross-comparison step
- The two suffix array builds can run in parallel on different cores if resources allow
- Variable-width pointers are determined independently per dataset based on its size
Step 3: Find across-similar duplicates
Perform a linear walk of both suffix arrays simultaneously to find all substrings above the length threshold that appear in both datasets. The algorithm merges the two sorted suffix arrays, identifying clusters of entries that span both datasets.
Key considerations:
- Both datasets must fit entirely in memory for this step
- For very large datasets (e.g., C4 at 350GB), this requires machines with substantial RAM
- Output is written to a cache directory with dups and sizes files, separately tagging duplicates found in each dataset
- This is more efficient than per-query lookup when the query text exceeds approximately len(dataset)/log(len(query)) bytes
Step 4: Collect duplicate ranges
Merge the duplicate pointers for either or both datasets into consolidated byte ranges. This step can be run targeting either dataset to produce removal ranges specific to that dataset.
Key considerations:
- You can choose to remove duplicates from one or both datasets
- Typically, duplicates are removed from the training set to prevent test-set contamination
- The collect step is run once per dataset you want to clean
Step 5: Remove duplicate byte ranges
Apply the collected byte ranges to the target dataset(s) to produce deduplicated output. Use the appropriate finish script based on the dataset format (single file or TFDS).
Key considerations:
- The choice of finish script depends on the original dataset format
- For single files, use the single file finish script
- For TFDS datasets, use the Wiki40B finish script (or write a custom one for other TFDS schemas)
- After removal, you may want to re-run the cross-comparison to verify no significant overlap remains