Implementation:Huggingface Datatrove ExactSubstringDedup
| Knowledge Sources | |
|---|---|
| Domains | Data Deduplication, NLP |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
The exact substring deduplication module implements a three-stage pipeline (ESDatasetToSequence, ESMergeSequences, ESRangeRemover) that removes duplicated substrings within documents using suffix arrays, following the methodology from "Deduplicating Training Data Makes Language Models Better" (arXiv:2107.06499).
Description
Unlike whole-document deduplication, exact substring deduplication identifies and removes duplicated passages that appear within documents. This is the most precise form of deduplication available, catching shared boilerplate text, copied paragraphs, and repeated passages that would survive document-level dedup.
ESDatasetToSequence (Stage 1) tokenizes each document using a HuggingFace tokenizer, prepends a unique 12-byte separator (containing rank and doc ID markers), and writes the token bytes as a binary sequence file. It also writes a companion sizes file recording the byte length of each document's sequence.
ESMergeSequences (Stage 2) concatenates all per-rank sequence files into a single large binary sequence and records cumulative byte offsets per file. This merged sequence is then processed by an external Rust tool (deduplicate-text-datasets) which builds a suffix array and identifies duplicate byte ranges.
ESRangeRemover (Stage 3) reads the duplicate byte ranges produced by the external tool, maps them back to individual documents using the byte offsets, decodes the duplicated substrings, removes them from each document's text, and drops documents that fall below a minimum word count after removal.
Usage
Use this module for the highest-fidelity deduplication of training data, particularly when preparing datasets for language model training. It requires the external deduplicate-text-datasets Rust tool to be run between stages 2 and 3.
Code Reference
Source Location
- Repository: Huggingface_Datatrove
- File: src/datatrove/pipeline/dedup/exact_substrings.py
- Lines: 1-343
Signature
class ESDatasetToSequence(PipelineStepWithTokenizer):
def __init__(
self,
output_folder: DataFolderLike,
tokenizer_name_or_path: str = "gpt2",
):
class ESMergeSequences(PipelineStep):
def __init__(
self,
data_folder: DataFolderLike,
tasks_stage_1: int,
bytes_per_batch: int = int(500e6),
):
class ESRangeRemover(PipelineStepWithTokenizer):
def __init__(
self,
sequence_folder: DataFolderLike,
tokenizer_name_or_path: str = "gpt2",
min_doc_words: int = 50,
language: str = Languages.english,
):
Import
from datatrove.pipeline.dedup.exact_substrings import (
ESDatasetToSequence,
ESMergeSequences,
ESRangeRemover,
)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| output_folder | DataFolderLike | Yes (Stage 1) | Folder where tokenized sequences are saved |
| tokenizer_name_or_path | str | No | HuggingFace tokenizer name or path (default: "gpt2") |
| data_folder | DataFolderLike | Yes (Stage 2) | Folder containing sequence files from Stage 1 |
| tasks_stage_1 | int | Yes (Stage 2) | Number of tasks used in Stage 1 |
| sequence_folder | DataFolderLike | Yes (Stage 3) | Folder containing sequences and byte range files |
| min_doc_words | int | No | Minimum words to keep a document after removal (default: 50) |
| language | str | No | Language for word tokenization (default: English) |
Outputs
| Name | Type | Description |
|---|---|---|
| Sequence files | Binary | Tokenized document sequences with separators (Stage 1) |
| Size files | Binary | Byte lengths of each document's sequence (Stage 1) |
| Big sequence | Binary | Single concatenated sequence of all documents (Stage 2) |
| Byte offsets | Binary | Cumulative byte offsets per file (Stage 2) |
| Filtered documents | DocumentsPipeline | Documents with duplicate substrings removed (Stage 3) |
Usage Examples
Basic Usage
from datatrove.pipeline.dedup.exact_substrings import (
ESDatasetToSequence,
ESMergeSequences,
ESRangeRemover,
)
# Stage 1: Convert documents to token sequences
stage1 = ESDatasetToSequence(
output_folder="/data/dedup/sequences",
tokenizer_name_or_path="gpt2",
)
# Stage 2: Merge all sequences into one
stage2 = ESMergeSequences(
data_folder="/data/dedup/sequences",
tasks_stage_1=100,
)
# (External step: run deduplicate-text-datasets Rust tool)
# Stage 3: Remove duplicate substrings
stage3 = ESRangeRemover(
sequence_folder="/data/dedup/sequences",
tokenizer_name_or_path="gpt2",
min_doc_words=50,
)