Principle:ChenghaoMou Text dedup Suffix Array Construction

Knowledge Sources	Deduplicating Training Data Makes Language Models Better text-dedup deduplicate-text-datasets
Domains	Data_Structures, Deduplication, NLP
Last Updated	2026-02-14 21:00 GMT

Overview

A data structure approach that builds suffix arrays over concatenated corpus text to identify exact duplicate substrings exceeding a byte-length threshold.

Description

Suffix Array Construction is the foundational step of substring-level deduplication as proposed by Lee et al. (2021). Unlike document-level methods (MinHash, SimHash, Bloom Filter) that work on whole documents, this approach operates at the substring level. The process: (1) concatenate all documents into a single byte stream, tracking per-document byte offsets, (2) build a suffix array over the entire byte stream using an external tool, (3) use the suffix array to find self-similar regions exceeding a length threshold, and (4) collect all duplicate byte ranges.

This technique was used to deduplicate C4, RealNews, and other large training corpora for language models, showing measurable improvements in downstream model quality.

Usage

Use this principle when substring-level exact deduplication is needed (e.g., removing boilerplate text, repeated paragraphs across documents). Requires the Google Research deduplicate-text-datasets external tool.

Theoretical Basis

A suffix array SA for a string S of length n is a sorted array of all suffixes: $S A [i] = j such that S [j . . n] is the i-th suffix in sorted order$

Self-similar regions are found by comparing adjacent suffixes in the sorted array. If two suffixes share a common prefix of length >= k, the corresponding byte range is marked as a duplicate.

# Abstract suffix array deduplication (NOT real implementation)
# Step 1: Concatenate all documents
byte_stream = b"".join(doc.encode() for doc in documents)
offsets = track_byte_boundaries(documents)

# Step 2: Build suffix array (external tool)
suffix_array = build_suffix_array(byte_stream)

# Step 3: Find self-similar regions
duplicates = find_self_similar(suffix_array, length_threshold=k)

# Step 4: Collect duplicate byte ranges
duplicate_ranges = collect(duplicates)

Related Pages

Implemented By

Implementation:ChenghaoMou_Text_dedup_SA_Run_Command

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment