Workflow:ChenghaoMou Text dedup Suffix Array Deduplication

Knowledge Sources	text-dedup Deduplicating Training Data Makes Language Models Better deduplicate-text-datasets text-dedup Docs
Domains	Data_Engineering, NLP, Deduplication
Last Updated	2026-02-14 21:00 GMT

Overview

End-to-end process for substring-level exact deduplication using suffix arrays, delegating to Google Research's deduplicate-text-datasets external tool.

Description

This workflow performs substring-level exact deduplication, removing repeated substrings that appear across documents rather than removing entire duplicate documents. It uses suffix arrays to efficiently find all repeated substrings above a configurable length threshold. The actual suffix array construction and self-similarity computation are delegated to Google Research's deduplicate-text-datasets tool (a Python/Rust hybrid). The text-dedup wrapper handles data preparation, byte offset tracking, result restoration, and substring merging/removal. This is the only algorithm in the library that modifies document content rather than removing entire documents.

Goal: A cleaned dataset where repeated substrings (above a length threshold) are removed from individual documents, with empty documents filtered out.

Scope: From raw text data through suffix array construction, self-similarity detection, substring restoration, and cleaned output.

Strategy: Concatenates all documents into a single byte stream, builds a suffix array, finds repeated substrings using self-similarity, then maps byte ranges back to document boundaries for targeted removal.

Usage

Execute this workflow when you need to remove repeated boilerplate text, templates, or common passages that appear across multiple documents in a corpus. Unlike document-level deduplication, this approach preserves unique content within documents while stripping shared substrings. It is particularly useful for cleaning web-crawled data that contains repeated headers, footers, navigation text, or legal boilerplate. Requires the external deduplicate-text-datasets repository to be cloned and built.

Execution Steps

Step 1: Configuration Loading

Parse the TOML configuration file into a typed Config object. The suffix array configuration specifies the path to the external Google Research repository, the length threshold for duplicate substrings, the merge strategy for overlapping duplicates, and a cache directory for intermediate suffix array data. Output settings automatically disable cluster saving since this algorithm does not produce cluster assignments.

Key considerations:

The google_repo_path must point to a cloned and built deduplicate-text-datasets repository
Length threshold controls the minimum substring length considered as a duplicate
Merge strategy (e.g., "longest") determines how overlapping duplicate regions are combined

Step 2: Data Loading and Text Serialization

Load the dataset and serialize all document texts into a single contiguous byte stream. Track byte offsets (start/end slices) for each document to enable mapping detected duplicate ranges back to their source documents. Temporary directories are created for the suffix array cache, output files, and intermediate data.

Key considerations:

All documents are concatenated into one file for suffix array construction
Byte offsets are tracked per document for later restoration
Temporary directories are created and cleaned up within the Google repo path

Step 3: Suffix Array Construction

Invoke the external tool's Python script to build a suffix array from the concatenated text file. The suffix array is a sorted array of all suffixes of the input text, enabling efficient identification of repeated substrings. This step shells out to the external repository's scripts.

Key considerations:

Delegates to an external Python script via subprocess
The suffix array is stored in the cache directory
Memory requirements scale with the total text size

Step 4: Self-Similarity Detection

Run the external tool's Rust-based self-similarity and collection commands. The self-similarity step identifies all byte ranges where text is repeated above the length threshold. The collect step aggregates these ranges into a final output file listing all duplicate byte spans. Multi-threading is supported for the self-similarity computation.

Key considerations:

Uses Rust (cargo run) for performance-critical computation
Two sub-steps: self-similar detection and result collection
The num_proc parameter controls thread count for parallelism

Step 5: Duplicate Range Restoration and Merging

Map the detected duplicate byte ranges back to individual documents using the tracked byte offsets. Overlapping or adjacent duplicate ranges within each document are merged according to the configured merge strategy. This produces a per-document list of byte slices to remove.

Key considerations:

Byte ranges from the global file must be translated to document-local offsets
The merge strategy handles overlapping duplicate regions
The restoration logic handles edge cases at document boundaries

Step 6: Substring Removal and Output

Remove the identified duplicate substrings from each document's text. Filter out any documents that become empty after removal. Save the cleaned dataset to disk. Optionally clean up all temporary files and cache directories.

Key considerations:

Documents may become empty after substring removal and are filtered out
Size reduction is reported in megabytes (before/after)
No cluster metadata is produced

Execution Diagram

GitHub URL

Workflow Repository