Principle:Google research Deduplicate text datasets Cross Dataset Duplicate Detection

Knowledge Sources	Deduplicating Training Data Makes Language Models Better deduplicate-text-datasets
Domains	Text_Deduplication, String_Algorithms, NLP
Last Updated	2026-02-14 21:00 GMT

Overview

An algorithm that identifies all exact substring matches shared between two different datasets, enabling detection of train/test overlap or cross-corpus contamination.

Description

Cross-dataset duplicate detection extends the self-similar approach to find shared substrings between two distinct datasets. Given two data files and their precomputed suffix arrays, the algorithm walks through both suffix arrays simultaneously in a merge-like fashion, computing longest common prefixes between suffixes drawn from different datasets. When a shared prefix exceeds the length threshold, it is recorded as a cross-dataset duplicate.

This technique is critical for detecting data contamination in machine learning, where training data may inadvertently contain test set examples. The original paper uses it to detect exact substring overlap between training and evaluation sets of large language models.

Unlike self-similar detection (which finds repeated substrings within one file), across-similar specifically identifies substrings that appear in both datasets. The output is bidirectional: it records duplicates found in dataset 1 that originate from dataset 2 (dups_S1_*) and vice versa (dups_S2_*).

Usage

Use this technique when you need to find shared content between two datasets, such as detecting train/test data leakage, measuring overlap between different corpora, or ensuring a new dataset does not contain content from an existing benchmark. Both datasets must be serialized to flat binary files with precomputed suffix arrays.

Theoretical Basis

The algorithm performs a coordinated walk over two suffix arrays:

# Abstract algorithm (NOT real implementation)
def find_across_similar(data1, sa1, data2, sa2, length_threshold):
    duplicates_in_1 = []  # Substrings in data1 also found in data2
    duplicates_in_2 = []  # Substrings in data2 also found in data1

    # Merge-walk through both suffix arrays
    i, j = 0, 0
    while i < len(sa1) and j < len(sa2):
        suffix1 = data1[sa1[i]:]
        suffix2 = data2[sa2[j]:]

        lcp = longest_common_prefix(suffix1, suffix2)

        if lcp >= length_threshold:
            duplicates_in_1.append((sa1[i], lcp))
            duplicates_in_2.append((sa2[j], lcp))

        if suffix1 < suffix2:
            i += 1
        else:
            j += 1

    return duplicates_in_1, duplicates_in_2

Bidirectional output: The algorithm produces two sets of duplicate markers. This is important because the user may want to clean only one dataset (e.g., remove test-set content found in the training set) while keeping the other intact.

Parallel partitioning: Like self-similar, the computation is divided across threads. Each thread handles a range of the combined suffix space and produces independent output files prefixed with S1 or S2 to indicate which dataset the duplicate was found in.

Related Pages

Implemented By

Implementation:Google_research_Deduplicate_text_datasets_Cmd_Across_Similar

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment