Principle:ChenghaoMou Text dedup Suffix Array Construction
| Knowledge Sources | |
|---|---|
| Domains | Data_Structures, Deduplication, NLP |
| Last Updated | 2026-02-14 21:00 GMT |
Overview
A data structure approach that builds suffix arrays over concatenated corpus text to identify exact duplicate substrings exceeding a byte-length threshold.
Description
Suffix Array Construction is the foundational step of substring-level deduplication as proposed by Lee et al. (2021). Unlike document-level methods (MinHash, SimHash, Bloom Filter) that work on whole documents, this approach operates at the substring level. The process: (1) concatenate all documents into a single byte stream, tracking per-document byte offsets, (2) build a suffix array over the entire byte stream using an external tool, (3) use the suffix array to find self-similar regions exceeding a length threshold, and (4) collect all duplicate byte ranges.
This technique was used to deduplicate C4, RealNews, and other large training corpora for language models, showing measurable improvements in downstream model quality.
Usage
Use this principle when substring-level exact deduplication is needed (e.g., removing boilerplate text, repeated paragraphs across documents). Requires the Google Research deduplicate-text-datasets external tool.
Theoretical Basis
A suffix array SA for a string S of length n is a sorted array of all suffixes:
Self-similar regions are found by comparing adjacent suffixes in the sorted array. If two suffixes share a common prefix of length >= k, the corresponding byte range is marked as a duplicate.
# Abstract suffix array deduplication (NOT real implementation)
# Step 1: Concatenate all documents
byte_stream = b"".join(doc.encode() for doc in documents)
offsets = track_byte_boundaries(documents)
# Step 2: Build suffix array (external tool)
suffix_array = build_suffix_array(byte_stream)
# Step 3: Find self-similar regions
duplicates = find_self_similar(suffix_array, length_threshold=k)
# Step 4: Collect duplicate byte ranges
duplicate_ranges = collect(duplicates)