Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Google research Deduplicate text datasets Cross Dataset Duplicate Detection

From Leeroopedia
Knowledge Sources
Domains Text_Deduplication, String_Algorithms, NLP
Last Updated 2026-02-14 21:00 GMT

Overview

An algorithm that identifies all exact substring matches shared between two different datasets, enabling detection of train/test overlap or cross-corpus contamination.

Description

Cross-dataset duplicate detection extends the self-similar approach to find shared substrings between two distinct datasets. Given two data files and their precomputed suffix arrays, the algorithm walks through both suffix arrays simultaneously in a merge-like fashion, computing longest common prefixes between suffixes drawn from different datasets. When a shared prefix exceeds the length threshold, it is recorded as a cross-dataset duplicate.

This technique is critical for detecting data contamination in machine learning, where training data may inadvertently contain test set examples. The original paper uses it to detect exact substring overlap between training and evaluation sets of large language models.

Unlike self-similar detection (which finds repeated substrings within one file), across-similar specifically identifies substrings that appear in both datasets. The output is bidirectional: it records duplicates found in dataset 1 that originate from dataset 2 (dups_S1_*) and vice versa (dups_S2_*).

Usage

Use this technique when you need to find shared content between two datasets, such as detecting train/test data leakage, measuring overlap between different corpora, or ensuring a new dataset does not contain content from an existing benchmark. Both datasets must be serialized to flat binary files with precomputed suffix arrays.

Theoretical Basis

The algorithm performs a coordinated walk over two suffix arrays:

# Abstract algorithm (NOT real implementation)
def find_across_similar(data1, sa1, data2, sa2, length_threshold):
    duplicates_in_1 = []  # Substrings in data1 also found in data2
    duplicates_in_2 = []  # Substrings in data2 also found in data1

    # Merge-walk through both suffix arrays
    i, j = 0, 0
    while i < len(sa1) and j < len(sa2):
        suffix1 = data1[sa1[i]:]
        suffix2 = data2[sa2[j]:]

        lcp = longest_common_prefix(suffix1, suffix2)

        if lcp >= length_threshold:
            duplicates_in_1.append((sa1[i], lcp))
            duplicates_in_2.append((sa2[j], lcp))

        if suffix1 < suffix2:
            i += 1
        else:
            j += 1

    return duplicates_in_1, duplicates_in_2

Bidirectional output: The algorithm produces two sets of duplicate markers. This is important because the user may want to clean only one dataset (e.g., remove test-set content found in the training set) while keeping the other intact.

Parallel partitioning: Like self-similar, the computation is divided across threads. Each thread handles a range of the combined suffix space and produces independent output files prefixed with S1 or S2 to indicate which dataset the duplicate was found in.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment