Principle:Google research Deduplicate text datasets Suffix Array Construction
| Knowledge Sources | |
|---|---|
| Domains | Data_Structures, String_Algorithms, Text_Deduplication |
| Last Updated | 2026-02-14 21:00 GMT |
Overview
A data structure construction algorithm that builds an array of sorted suffix pointers over a byte sequence, enabling efficient O(log N) substring queries and duplicate detection.
Description
A suffix array for a sequence S is an array containing the starting positions of all suffixes of S, sorted in lexicographic order. Rather than storing entire suffixes, only integer pointers to the start of each suffix are stored, making the representation space-efficient at O(N) where N is the length of the input.
Suffix arrays solve the fundamental problem of efficient substring search across massive text corpora. Without a suffix array, finding all occurrences of a substring requires O(N*M) time (where M is query length). With a suffix array, the same query takes O(M log N) time via binary search. This efficiency is critical when deduplicating datasets containing hundreds of gigabytes of text.
The construction uses the SA-IS (Suffix Array Induced Sorting) algorithm, which builds the suffix array in O(N) linear time. The key insight of SA-IS is to classify suffixes as either S-type (smaller than the following suffix) or L-type (larger), identify the leftmost S-type suffixes (LMS suffixes), and use these to induce the positions of all other suffixes.
For datasets that exceed available memory, the construction is parallelized by splitting the input into overlapping chunks, building partial suffix arrays for each chunk independently, and then merging them into a single global suffix array using a multi-way merge with a priority queue.
Usage
Use this technique when you need to build an index over a large text corpus for substring-level deduplication or querying. It is the prerequisite step for all deduplication workflows in this repository (self-similar, across-similar) and for count-occurrence queries. The suffix array must be built once per dataset and can be reused for multiple deduplication or query passes.
Theoretical Basis
The suffix array A for a string S of length N is defined as:
Where S[j..] denotes the suffix of S starting at position j.
SA-IS Algorithm (high-level):
# Abstract algorithm description (NOT real implementation)
def build_suffix_array(text):
# 1. Classify each suffix as S-type or L-type
types = classify_suffixes(text) # O(N)
# 2. Find LMS (Leftmost S-type) suffixes
lms_positions = find_lms_suffixes(types) # O(N)
# 3. Bucket sort LMS suffixes, then induce L-type and S-type
sa = induced_sort(text, types, lms_positions) # O(N)
# 4. If LMS substrings are not unique, recurse
if not all_unique(lms_substrings):
reduced = reduce_problem(lms_substrings)
sa_reduced = build_suffix_array(reduced) # Recurse
sa = induce_from_reduced(sa_reduced)
return sa
Parallel construction for large files:
# Abstract parallel construction
def parallel_suffix_array(data_file, num_jobs):
chunk_size = len(data_file) // num_jobs
overlap = 100000 # bytes to handle cross-boundary matches
# Phase 1: Build partial suffix arrays in parallel
for i in range(num_jobs):
start = i * chunk_size
end = min((i + 1) * chunk_size + overlap, len(data_file))
partial_sa[i] = build_suffix_array(data_file[start:end])
# Phase 2: Multi-way merge partial arrays into global array
global_sa = merge(partial_sa, num_threads)
return global_sa
Variable-width pointers: To save disk space, each suffix array entry uses only the minimum number of bytes needed to address the dataset: ceil(log2(file_size) / 8) bytes per entry. For example, a 1GB file uses 4-byte pointers, while a 1TB file uses 5-byte pointers.