Principle:Google research Deduplicate text datasets Substring Occurrence Querying

Knowledge Sources	Deduplicating Training Data Makes Language Models Better deduplicate-text-datasets
Domains	String_Algorithms, Data_Structures, NLP
Last Updated	2026-02-14 21:00 GMT

Overview

A query technique that uses a precomputed suffix array to count (and optionally locate) all occurrences of a given substring in a large text dataset in O(M log N) time.

Description

Substring occurrence querying leverages the sorted property of suffix arrays to perform fast substring search via binary search. Given a query string of length M and a dataset of size N with a precomputed suffix array, the algorithm finds all positions where the query appears by performing two binary searches: one to find the leftmost matching suffix and one to find the rightmost. The count of occurrences is simply the difference between these two positions.

This technique enables ad-hoc analysis of large corpora without re-scanning the entire dataset. For example, researchers can quickly check how many times a specific sentence, phrase, or token sequence appears in a training set. With the optional --print-location flag, the exact byte position of each occurrence can also be reported.

The query can operate on raw text (UTF-8 encoded) or tokenized sequences (uint16 token IDs), enabling both character-level and token-level search.

Usage

Use this technique when you need to count or locate occurrences of a specific substring in a large dataset that already has a suffix array built. It is useful for data analysis, memorization detection, and verifying deduplication results. Do not use for bulk deduplication (use self-similar or across-similar instead).

Theoretical Basis

Binary search on a suffix array:

# Abstract algorithm (NOT real implementation)
def count_occurrences(data, suffix_array, query):
    """Count occurrences of query in data using binary search on suffix array."""
    n = len(suffix_array)

    # Find leftmost position where query could appear
    lo, hi = 0, n
    while lo < hi:
        mid = (lo + hi) // 2
        suffix = data[suffix_array[mid]:]
        if suffix[:len(query)] < query:
            lo = mid + 1
        else:
            hi = mid
    left = lo

    # Find rightmost position where query could appear
    lo, hi = left, n
    while lo < hi:
        mid = (lo + hi) // 2
        suffix = data[suffix_array[mid]:]
        if suffix[:len(query)] <= query:
            lo = mid + 1
        else:
            hi = mid
    right = lo

    count = right - left
    positions = [suffix_array[i] for i in range(left, right)]
    return count, positions

Time complexity: O(M log N) where M is the query length and N is the dataset size. Each binary search step requires an O(M) string comparison, and there are O(log N) steps.

Memory modes:

RAM mode (default): Load the entire suffix array into memory for fastest queries.
Disk mode (--load-disk): Stream the suffix array from disk using memory-mapped I/O, trading speed for lower memory usage.

Related Pages

Implemented By

Implementation:Google_research_Deduplicate_text_datasets_Count_Occurrences

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment