Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:ChenghaoMou Text dedup Suffix Array Construction

From Leeroopedia
Knowledge Sources
Domains Data_Structures, Deduplication, NLP
Last Updated 2026-02-14 21:00 GMT

Overview

A data structure approach that builds suffix arrays over concatenated corpus text to identify exact duplicate substrings exceeding a byte-length threshold.

Description

Suffix Array Construction is the foundational step of substring-level deduplication as proposed by Lee et al. (2021). Unlike document-level methods (MinHash, SimHash, Bloom Filter) that work on whole documents, this approach operates at the substring level. The process: (1) concatenate all documents into a single byte stream, tracking per-document byte offsets, (2) build a suffix array over the entire byte stream using an external tool, (3) use the suffix array to find self-similar regions exceeding a length threshold, and (4) collect all duplicate byte ranges.

This technique was used to deduplicate C4, RealNews, and other large training corpora for language models, showing measurable improvements in downstream model quality.

Usage

Use this principle when substring-level exact deduplication is needed (e.g., removing boilerplate text, repeated paragraphs across documents). Requires the Google Research deduplicate-text-datasets external tool.

Theoretical Basis

A suffix array SA for a string S of length n is a sorted array of all suffixes: SA[i]=j such that S[j..n] is the i-th suffix in sorted order

Self-similar regions are found by comparing adjacent suffixes in the sorted array. If two suffixes share a common prefix of length >= k, the corresponding byte range is marked as a duplicate.

# Abstract suffix array deduplication (NOT real implementation)
# Step 1: Concatenate all documents
byte_stream = b"".join(doc.encode() for doc in documents)
offsets = track_byte_boundaries(documents)

# Step 2: Build suffix array (external tool)
suffix_array = build_suffix_array(byte_stream)

# Step 3: Find self-similar regions
duplicates = find_self_similar(suffix_array, length_threshold=k)

# Step 4: Collect duplicate byte ranges
duplicate_ranges = collect(duplicates)

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment