Principle:Google research Deduplicate text datasets Self Similar Duplicate Detection

Knowledge Sources	Deduplicating Training Data Makes Language Models Better deduplicate-text-datasets
Domains	Text_Deduplication, String_Algorithms, NLP
Last Updated	2026-02-14 21:00 GMT

Overview

An algorithm that identifies all repeated substrings within a single document that exceed a given length threshold, using a suffix array to efficiently find exact duplicate byte sequences.

Description

Self-similar duplicate detection solves the problem of finding all exact substring duplicates within a single large text corpus. Given a dataset and its precomputed suffix array, the algorithm walks through consecutive entries in the sorted suffix array, computing the longest common prefix (LCP) between adjacent suffixes. When this LCP exceeds the configured length threshold, the algorithm records both the position and length of the duplicate.

This approach is fundamentally different from document-level or n-gram deduplication. Rather than comparing entire documents or fixed-size windows, it finds exact substring matches of arbitrary length at byte granularity. This captures duplicates that span partial sentences, paragraphs, or any contiguous byte range.

The algorithm is designed for datasets that are too large to hold both the data and suffix array in memory simultaneously. The suffix array is streamed from disk using buffered memory-mapped I/O, while the data file is held in RAM. Computation is parallelized by partitioning the suffix array range across multiple threads, each producing independent output files.

Usage

Use this technique when you need to find all repeated substrings within a single dataset (e.g., finding duplicated passages within a training corpus). It is the standard deduplication mode for single-file workflows. For finding duplicates between two different datasets (e.g., train/test overlap), use cross-dataset duplicate detection instead.

Theoretical Basis

The algorithm exploits the fundamental property of suffix arrays: consecutive entries in the sorted array share the longest common prefix among all suffix pairs.

Key insight: If two suffixes S[i..] and S[j..] share a common prefix of length L, they will appear adjacent (or nearly adjacent) in the suffix array. By scanning the suffix array linearly and computing LCPs, all duplicates can be found in O(N) time.

# Abstract algorithm (NOT real implementation)
def find_self_similar(data, suffix_array, length_threshold):
    duplicates = []

    for i in range(1, len(suffix_array)):
        pos_a = suffix_array[i - 1]
        pos_b = suffix_array[i]

        # Compute longest common prefix
        lcp_length = longest_common_prefix(data, pos_a, pos_b)

        if lcp_length >= length_threshold:
            # Record both positions and the duplicate length
            duplicates.append((pos_a, pos_b, lcp_length))

    return duplicates

Streaming for large datasets: The suffix array may be hundreds of gigabytes (or terabytes for datasets like C4). The algorithm streams it from disk using TableStream with 1MB buffered reads, requiring only O(1) additional memory beyond the data file itself.

Parallel partitioning: The suffix array range is divided across N threads. Each thread processes its partition independently, producing separate output files (dups_* and sizes_*). Since each partition is a contiguous range of the sorted suffix array, no inter-thread communication is needed.

Related Pages

Implemented By

Implementation:Google_research_Deduplicate_text_datasets_Cmd_Self_Similar

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment