Principle:Google research Deduplicate text datasets Cross Dataset Duplicate Detection
| Knowledge Sources | |
|---|---|
| Domains | Text_Deduplication, String_Algorithms, NLP |
| Last Updated | 2026-02-14 21:00 GMT |
Overview
An algorithm that identifies all exact substring matches shared between two different datasets, enabling detection of train/test overlap or cross-corpus contamination.
Description
Cross-dataset duplicate detection extends the self-similar approach to find shared substrings between two distinct datasets. Given two data files and their precomputed suffix arrays, the algorithm walks through both suffix arrays simultaneously in a merge-like fashion, computing longest common prefixes between suffixes drawn from different datasets. When a shared prefix exceeds the length threshold, it is recorded as a cross-dataset duplicate.
This technique is critical for detecting data contamination in machine learning, where training data may inadvertently contain test set examples. The original paper uses it to detect exact substring overlap between training and evaluation sets of large language models.
Unlike self-similar detection (which finds repeated substrings within one file), across-similar specifically identifies substrings that appear in both datasets. The output is bidirectional: it records duplicates found in dataset 1 that originate from dataset 2 (dups_S1_*) and vice versa (dups_S2_*).
Usage
Use this technique when you need to find shared content between two datasets, such as detecting train/test data leakage, measuring overlap between different corpora, or ensuring a new dataset does not contain content from an existing benchmark. Both datasets must be serialized to flat binary files with precomputed suffix arrays.
Theoretical Basis
The algorithm performs a coordinated walk over two suffix arrays:
# Abstract algorithm (NOT real implementation)
def find_across_similar(data1, sa1, data2, sa2, length_threshold):
duplicates_in_1 = [] # Substrings in data1 also found in data2
duplicates_in_2 = [] # Substrings in data2 also found in data1
# Merge-walk through both suffix arrays
i, j = 0, 0
while i < len(sa1) and j < len(sa2):
suffix1 = data1[sa1[i]:]
suffix2 = data2[sa2[j]:]
lcp = longest_common_prefix(suffix1, suffix2)
if lcp >= length_threshold:
duplicates_in_1.append((sa1[i], lcp))
duplicates_in_2.append((sa2[j], lcp))
if suffix1 < suffix2:
i += 1
else:
j += 1
return duplicates_in_1, duplicates_in_2
Bidirectional output: The algorithm produces two sets of duplicate markers. This is important because the user may want to clean only one dataset (e.g., remove test-set content found in the training set) while keeping the other intact.
Parallel partitioning: Like self-similar, the computation is divided across threads. Each thread handles a range of the combined suffix space and produces independent output files prefixed with S1 or S2 to indicate which dataset the duplicate was found in.