Implementation:Huggingface Datatrove ExactSubstringDedup

Knowledge Sources	Huggingface_Datatrove
Domains	Data Deduplication, NLP
Last Updated	2026-02-14 17:00 GMT

Overview

The exact substring deduplication module implements a three-stage pipeline (ESDatasetToSequence, ESMergeSequences, ESRangeRemover) that removes duplicated substrings within documents using suffix arrays, following the methodology from "Deduplicating Training Data Makes Language Models Better" (arXiv:2107.06499).

Description

Unlike whole-document deduplication, exact substring deduplication identifies and removes duplicated passages that appear within documents. This is the most precise form of deduplication available, catching shared boilerplate text, copied paragraphs, and repeated passages that would survive document-level dedup.

ESDatasetToSequence (Stage 1) tokenizes each document using a HuggingFace tokenizer, prepends a unique 12-byte separator (containing rank and doc ID markers), and writes the token bytes as a binary sequence file. It also writes a companion sizes file recording the byte length of each document's sequence.

ESMergeSequences (Stage 2) concatenates all per-rank sequence files into a single large binary sequence and records cumulative byte offsets per file. This merged sequence is then processed by an external Rust tool (deduplicate-text-datasets) which builds a suffix array and identifies duplicate byte ranges.

ESRangeRemover (Stage 3) reads the duplicate byte ranges produced by the external tool, maps them back to individual documents using the byte offsets, decodes the duplicated substrings, removes them from each document's text, and drops documents that fall below a minimum word count after removal.

Usage

Use this module for the highest-fidelity deduplication of training data, particularly when preparing datasets for language model training. It requires the external deduplicate-text-datasets Rust tool to be run between stages 2 and 3.

Code Reference

Source Location

Repository: Huggingface_Datatrove
File: src/datatrove/pipeline/dedup/exact_substrings.py
Lines: 1-343

Signature

class ESDatasetToSequence(PipelineStepWithTokenizer):
    def __init__(
        self,
        output_folder: DataFolderLike,
        tokenizer_name_or_path: str = "gpt2",
    ):

class ESMergeSequences(PipelineStep):
    def __init__(
        self,
        data_folder: DataFolderLike,
        tasks_stage_1: int,
        bytes_per_batch: int = int(500e6),
    ):

class ESRangeRemover(PipelineStepWithTokenizer):
    def __init__(
        self,
        sequence_folder: DataFolderLike,
        tokenizer_name_or_path: str = "gpt2",
        min_doc_words: int = 50,
        language: str = Languages.english,
    ):

Import

from datatrove.pipeline.dedup.exact_substrings import (
    ESDatasetToSequence,
    ESMergeSequences,
    ESRangeRemover,
)

I/O Contract

Inputs

Name	Type	Required	Description
output_folder	DataFolderLike	Yes (Stage 1)	Folder where tokenized sequences are saved
tokenizer_name_or_path	str	No	HuggingFace tokenizer name or path (default: "gpt2")
data_folder	DataFolderLike	Yes (Stage 2)	Folder containing sequence files from Stage 1
tasks_stage_1	int	Yes (Stage 2)	Number of tasks used in Stage 1
sequence_folder	DataFolderLike	Yes (Stage 3)	Folder containing sequences and byte range files
min_doc_words	int	No	Minimum words to keep a document after removal (default: 50)
language	str	No	Language for word tokenization (default: English)

Outputs

Name	Type	Description
Sequence files	Binary	Tokenized document sequences with separators (Stage 1)
Size files	Binary	Byte lengths of each document's sequence (Stage 1)
Big sequence	Binary	Single concatenated sequence of all documents (Stage 2)
Byte offsets	Binary	Cumulative byte offsets per file (Stage 2)
Filtered documents	DocumentsPipeline	Documents with duplicate substrings removed (Stage 3)

Usage Examples

Basic Usage

from datatrove.pipeline.dedup.exact_substrings import (
    ESDatasetToSequence,
    ESMergeSequences,
    ESRangeRemover,
)

# Stage 1: Convert documents to token sequences
stage1 = ESDatasetToSequence(
    output_folder="/data/dedup/sequences",
    tokenizer_name_or_path="gpt2",
)

# Stage 2: Merge all sequences into one
stage2 = ESMergeSequences(
    data_folder="/data/dedup/sequences",
    tasks_stage_1=100,
)

# (External step: run deduplicate-text-datasets Rust tool)

# Stage 3: Remove duplicate substrings
stage3 = ESRangeRemover(
    sequence_folder="/data/dedup/sequences",
    tokenizer_name_or_path="gpt2",
    min_doc_words=50,
)

Related Pages

Principle:Huggingface_Datatrove_Exact_Substring_Deduplication

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment