Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Datatrove ExactSubstringDedup

From Leeroopedia
Knowledge Sources
Domains Data Deduplication, NLP
Last Updated 2026-02-14 17:00 GMT

Overview

The exact substring deduplication module implements a three-stage pipeline (ESDatasetToSequence, ESMergeSequences, ESRangeRemover) that removes duplicated substrings within documents using suffix arrays, following the methodology from "Deduplicating Training Data Makes Language Models Better" (arXiv:2107.06499).

Description

Unlike whole-document deduplication, exact substring deduplication identifies and removes duplicated passages that appear within documents. This is the most precise form of deduplication available, catching shared boilerplate text, copied paragraphs, and repeated passages that would survive document-level dedup.

ESDatasetToSequence (Stage 1) tokenizes each document using a HuggingFace tokenizer, prepends a unique 12-byte separator (containing rank and doc ID markers), and writes the token bytes as a binary sequence file. It also writes a companion sizes file recording the byte length of each document's sequence.

ESMergeSequences (Stage 2) concatenates all per-rank sequence files into a single large binary sequence and records cumulative byte offsets per file. This merged sequence is then processed by an external Rust tool (deduplicate-text-datasets) which builds a suffix array and identifies duplicate byte ranges.

ESRangeRemover (Stage 3) reads the duplicate byte ranges produced by the external tool, maps them back to individual documents using the byte offsets, decodes the duplicated substrings, removes them from each document's text, and drops documents that fall below a minimum word count after removal.

Usage

Use this module for the highest-fidelity deduplication of training data, particularly when preparing datasets for language model training. It requires the external deduplicate-text-datasets Rust tool to be run between stages 2 and 3.

Code Reference

Source Location

Signature

class ESDatasetToSequence(PipelineStepWithTokenizer):
    def __init__(
        self,
        output_folder: DataFolderLike,
        tokenizer_name_or_path: str = "gpt2",
    ):

class ESMergeSequences(PipelineStep):
    def __init__(
        self,
        data_folder: DataFolderLike,
        tasks_stage_1: int,
        bytes_per_batch: int = int(500e6),
    ):

class ESRangeRemover(PipelineStepWithTokenizer):
    def __init__(
        self,
        sequence_folder: DataFolderLike,
        tokenizer_name_or_path: str = "gpt2",
        min_doc_words: int = 50,
        language: str = Languages.english,
    ):

Import

from datatrove.pipeline.dedup.exact_substrings import (
    ESDatasetToSequence,
    ESMergeSequences,
    ESRangeRemover,
)

I/O Contract

Inputs

Name Type Required Description
output_folder DataFolderLike Yes (Stage 1) Folder where tokenized sequences are saved
tokenizer_name_or_path str No HuggingFace tokenizer name or path (default: "gpt2")
data_folder DataFolderLike Yes (Stage 2) Folder containing sequence files from Stage 1
tasks_stage_1 int Yes (Stage 2) Number of tasks used in Stage 1
sequence_folder DataFolderLike Yes (Stage 3) Folder containing sequences and byte range files
min_doc_words int No Minimum words to keep a document after removal (default: 50)
language str No Language for word tokenization (default: English)

Outputs

Name Type Description
Sequence files Binary Tokenized document sequences with separators (Stage 1)
Size files Binary Byte lengths of each document's sequence (Stage 1)
Big sequence Binary Single concatenated sequence of all documents (Stage 2)
Byte offsets Binary Cumulative byte offsets per file (Stage 2)
Filtered documents DocumentsPipeline Documents with duplicate substrings removed (Stage 3)

Usage Examples

Basic Usage

from datatrove.pipeline.dedup.exact_substrings import (
    ESDatasetToSequence,
    ESMergeSequences,
    ESRangeRemover,
)

# Stage 1: Convert documents to token sequences
stage1 = ESDatasetToSequence(
    output_folder="/data/dedup/sequences",
    tokenizer_name_or_path="gpt2",
)

# Stage 2: Merge all sequences into one
stage2 = ESMergeSequences(
    data_folder="/data/dedup/sequences",
    tasks_stage_1=100,
)

# (External step: run deduplicate-text-datasets Rust tool)

# Stage 3: Remove duplicate substrings
stage3 = ESRangeRemover(
    sequence_folder="/data/dedup/sequences",
    tokenizer_name_or_path="gpt2",
    min_doc_words=50,
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment