Principle:Huggingface Datatrove Exact Substring Deduplication

Knowledge Sources	Huggingface_Datatrove
Domains	Data Deduplication, NLP
Last Updated	2026-02-14 17:00 GMT

Overview

Exact Substring Deduplication is the principle of identifying and removing duplicated text passages within documents using suffix arrays, enabling sub-document-level deduplication that catches shared boilerplate and copied paragraphs.

Description

While whole-document deduplication removes entire documents that are identical, many documents share partial content such as boilerplate headers, copyright notices, or copied paragraphs. Exact substring deduplication addresses this by operating at the byte level to find and remove any substring that appears more than once across the corpus, regardless of document boundaries.

This technique was introduced by Lee et al. in "Deduplicating Training Data Makes Language Models Better" (2021) and has become a standard step in preparing high-quality training data for large language models. The approach builds a suffix array over the entire tokenized corpus to efficiently identify all repeated substrings, then removes them while preserving unique content.

Usage

Apply exact substring deduplication when preparing training data for language models and when finer-grained deduplication than whole-document matching is needed. It is particularly effective for web-crawled data where boilerplate text and copied passages are pervasive.

Theoretical Basis

The algorithm relies on suffix arrays, a data structure that provides a sorted ordering of all suffixes of a string. Given a corpus concatenated into a single sequence of tokens:

Suffix Array Construction: All suffixes of the concatenated sequence are sorted lexicographically. This is performed by an external Rust tool for efficiency.

Duplicate Range Identification: By traversing the sorted suffix array, adjacent suffixes that share long common prefixes indicate duplicate substrings. The tool outputs byte ranges marking the locations of duplicated content.

Range Categories: When mapping byte ranges back to individual documents, four cases arise depending on how a duplicate range overlaps with a document boundary:
- Left: Range starts before the document and ends within it
- Centre: Range is entirely within the document
- Right: Range starts within the document and extends beyond it
- Outside: Range entirely encompasses the document

Document Reconstruction: After identifying which byte ranges within each document are duplicated, the corresponding text is decoded and removed. Documents falling below a minimum word count threshold after removal are discarded entirely.

Separator Tokens: Unique 12-byte separators (containing rank ID and document ID) are prepended to each document before concatenation, preventing false matches across document boundaries at the join points.

Related Pages

Implementation:Huggingface_Datatrove_ExactSubstringDedup

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment