Principle:Google research Deduplicate text datasets Suffix Array Construction

Knowledge Sources	Deduplicating Training Data Makes Language Models Better SA-IS Linear Time Suffix Array Construction deduplicate-text-datasets
Domains	Data_Structures, String_Algorithms, Text_Deduplication
Last Updated	2026-02-14 21:00 GMT

Overview

A data structure construction algorithm that builds an array of sorted suffix pointers over a byte sequence, enabling efficient O(log N) substring queries and duplicate detection.

Description

A suffix array for a sequence S is an array containing the starting positions of all suffixes of S, sorted in lexicographic order. Rather than storing entire suffixes, only integer pointers to the start of each suffix are stored, making the representation space-efficient at O(N) where N is the length of the input.

Suffix arrays solve the fundamental problem of efficient substring search across massive text corpora. Without a suffix array, finding all occurrences of a substring requires O(N*M) time (where M is query length). With a suffix array, the same query takes O(M log N) time via binary search. This efficiency is critical when deduplicating datasets containing hundreds of gigabytes of text.

The construction uses the SA-IS (Suffix Array Induced Sorting) algorithm, which builds the suffix array in O(N) linear time. The key insight of SA-IS is to classify suffixes as either S-type (smaller than the following suffix) or L-type (larger), identify the leftmost S-type suffixes (LMS suffixes), and use these to induce the positions of all other suffixes.

For datasets that exceed available memory, the construction is parallelized by splitting the input into overlapping chunks, building partial suffix arrays for each chunk independently, and then merging them into a single global suffix array using a multi-way merge with a priority queue.

Usage

Use this technique when you need to build an index over a large text corpus for substring-level deduplication or querying. It is the prerequisite step for all deduplication workflows in this repository (self-similar, across-similar) and for count-occurrence queries. The suffix array must be built once per dataset and can be reused for multiple deduplication or query passes.

Theoretical Basis

The suffix array A for a string S of length N is defined as:

$A [i] = j such that S [A [0] . .] \leq S [A [1] . .] \leq \dots \leq S [A [N - 1] . .]$

Where S[j..] denotes the suffix of S starting at position j.

SA-IS Algorithm (high-level):

# Abstract algorithm description (NOT real implementation)
def build_suffix_array(text):
    # 1. Classify each suffix as S-type or L-type
    types = classify_suffixes(text)  # O(N)

    # 2. Find LMS (Leftmost S-type) suffixes
    lms_positions = find_lms_suffixes(types)  # O(N)

    # 3. Bucket sort LMS suffixes, then induce L-type and S-type
    sa = induced_sort(text, types, lms_positions)  # O(N)

    # 4. If LMS substrings are not unique, recurse
    if not all_unique(lms_substrings):
        reduced = reduce_problem(lms_substrings)
        sa_reduced = build_suffix_array(reduced)  # Recurse
        sa = induce_from_reduced(sa_reduced)

    return sa

Parallel construction for large files:

# Abstract parallel construction
def parallel_suffix_array(data_file, num_jobs):
    chunk_size = len(data_file) // num_jobs
    overlap = 100000  # bytes to handle cross-boundary matches

    # Phase 1: Build partial suffix arrays in parallel
    for i in range(num_jobs):
        start = i * chunk_size
        end = min((i + 1) * chunk_size + overlap, len(data_file))
        partial_sa[i] = build_suffix_array(data_file[start:end])

    # Phase 2: Multi-way merge partial arrays into global array
    global_sa = merge(partial_sa, num_threads)
    return global_sa

Variable-width pointers: To save disk space, each suffix array entry uses only the minimum number of bytes needed to address the dataset: ceil(log2(file_size) / 8) bytes per entry. For example, a 1GB file uses 4-byte pointers, while a 1TB file uses 5-byte pointers.

Related Pages

Implemented By

Implementation:Google_research_Deduplicate_text_datasets_Make_Suffix_Array

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment