Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Google research Deduplicate text datasets Cross dataset deduplication

From Leeroopedia
Knowledge Sources
Domains Data_Engineering, NLP, Deduplication
Last Updated 2026-02-14 21:00 GMT

Overview

End-to-end process for finding and removing exact substring duplicates that exist between two different datasets, such as detecting train/test overlap or cross-corpus contamination.

Description

This workflow identifies all substrings above a length threshold that appear in both of two distinct datasets. Unlike self-similar deduplication which finds repeats within a single file, cross-dataset deduplication performs a linear walk of both suffix arrays simultaneously to find shared content. This is essential for detecting data contamination, such as test set leakage into training data, or overlap between independently collected corpora.

Key characteristics:

  • Compares two separate datasets by walking their suffix arrays in parallel
  • Requires both datasets and their suffix arrays to fit in memory simultaneously
  • Produces duplicate locations for both datasets (what in dataset A also appears in dataset B, and vice versa)
  • Linear time complexity O(len(dataset1) + len(dataset2)) after suffix array construction
  • More efficient than per-query lookup when comparing large amounts of text

Usage

Execute this workflow when you need to find shared content between two different datasets. Common use cases include detecting train/test set overlap (data contamination), finding duplicates between independently collected corpora, and auditing whether a new dataset contains content already present in existing training data. Both datasets must have pre-built suffix arrays, and the machine must have enough RAM to hold both datasets in memory.

Execution Steps

Step 1: Load and serialize both datasets

Load each dataset and serialize it into a flat binary file with separator tokens and unique IDs. This can use either the TFDS loader or the HuggingFace loader depending on the dataset source. Both datasets must be serialized independently, each producing its own binary file and size file.

Key considerations:

  • Both datasets must use the same serialization format (same separator scheme)
  • If using tokenization, both datasets must use the same tokenizer
  • The HuggingFace loader supports loading from local files (text, JSON, CSV) or from the HuggingFace Hub

Step 2: Build suffix arrays for both datasets

Construct suffix arrays for each dataset independently using the parallel chunked construction process. Each dataset gets its own suffix array file.

Key considerations:

  • Both suffix arrays must be fully built before the cross-comparison step
  • The two suffix array builds can run in parallel on different cores if resources allow
  • Variable-width pointers are determined independently per dataset based on its size

Step 3: Find across-similar duplicates

Perform a linear walk of both suffix arrays simultaneously to find all substrings above the length threshold that appear in both datasets. The algorithm merges the two sorted suffix arrays, identifying clusters of entries that span both datasets.

Key considerations:

  • Both datasets must fit entirely in memory for this step
  • For very large datasets (e.g., C4 at 350GB), this requires machines with substantial RAM
  • Output is written to a cache directory with dups and sizes files, separately tagging duplicates found in each dataset
  • This is more efficient than per-query lookup when the query text exceeds approximately len(dataset)/log(len(query)) bytes

Step 4: Collect duplicate ranges

Merge the duplicate pointers for either or both datasets into consolidated byte ranges. This step can be run targeting either dataset to produce removal ranges specific to that dataset.

Key considerations:

  • You can choose to remove duplicates from one or both datasets
  • Typically, duplicates are removed from the training set to prevent test-set contamination
  • The collect step is run once per dataset you want to clean

Step 5: Remove duplicate byte ranges

Apply the collected byte ranges to the target dataset(s) to produce deduplicated output. Use the appropriate finish script based on the dataset format (single file or TFDS).

Key considerations:

  • The choice of finish script depends on the original dataset format
  • For single files, use the single file finish script
  • For TFDS datasets, use the Wiki40B finish script (or write a custom one for other TFDS schemas)
  • After removal, you may want to re-run the cross-comparison to verify no significant overlap remains

Execution Diagram

GitHub URL

Workflow Repository