Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Huggingface Datatrove ExactDedupSignature

From Leeroopedia
Revision as of 13:01, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Huggingface_Datatrove_ExactDedupSignature.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains Data Deduplication, Data Processing
Last Updated 2026-02-14 17:00 GMT

Overview

The exact deduplication module implements a three-stage distributed pipeline (ExactDedupSignature, ExactFindDedups, ExactDedupFilter) that removes documents with identical content based on hash matching, along with ExactDedupConfig for configuration and ExactDedupBuildIndex for cross-dataset deduplication.

Description

This module provides exact content deduplication through a multi-stage approach. ExactDedupConfig is a dataclass that holds the configuration, including a content_getter callable that extracts the content to hash from each document, an optional document_priority callable for determining which duplicate to keep, a HashConfig for hash function settings, and an only_dedup_in_index flag.

ExactDedupSignature (Stage 1) is a PipelineStep that hashes each document's content using the configured hash function and stores the results as sorted binary signature files. Signatures are partitioned across finder_workers hash range buckets for parallel processing in Stage 2. Each signature contains the hash value, a priority value (1-65535), and the document ID.

ExactFindDedups (Stage 2) merges all sorted signature files using a heap-based priority queue, identifies documents sharing the same hash (exact duplicates), and writes out the document IDs of duplicates to remove. The highest-priority document is retained. It supports optional deduplication against a pre-built index for cross-dataset dedup.

ExactDedupFilter (Stage 3) loads the duplicate document IDs from Stage 2, iterates through the document pipeline, drops flagged duplicates (optionally saving them via an exclusion writer), and annotates surviving documents with a duplicate_count metadata field.

ExactDedupBuildIndex creates a hash-only index from signature files for cross-dataset deduplication.

Usage

Use this module when you need to remove documents with identical content from a large dataset. The three stages must be run in sequence, typically as separate pipeline executor tasks to allow distributed processing at each stage.

Code Reference

Source Location

Signature

@dataclass
class ExactDedupConfig:
    content_getter: Callable[[Document], bytes | str]
    document_priority: Callable[[Document], int] | None = None
    hash_config: HashConfig = field(default_factory=HashConfig)
    only_dedup_in_index: bool = True

class ExactDedupSignature(PipelineStep):
    def __init__(
        self,
        output_folder: DataFolderLike,
        config: ExactDedupConfig,
        finder_workers: int = 1,
    ):

class ExactFindDedups(PipelineStep):
    def __init__(
        self,
        data_folder: DataFolderLike,
        output_folder: DataFolderLike,
        config: ExactDedupConfig,
        save_cluster_size: bool = False,
        index_folder: DataFolderLike | None = None,
        lines_to_buffer: int = 5,
    ):

class ExactDedupFilter(PipelineStep):
    def __init__(
        self,
        data_folder: DataFolderLike,
        config: ExactDedupConfig,
        exclusion_writer: DiskWriter | None = None,
    ):

Import

from datatrove.pipeline.dedup.exact_dedup import (
    ExactDedupConfig,
    ExactDedupSignature,
    ExactFindDedups,
    ExactDedupFilter,
    ExactDedupBuildIndex,
)

I/O Contract

Inputs

Name Type Required Description
output_folder DataFolderLike Yes (Stage 1) Folder where hash signatures are saved
config ExactDedupConfig Yes Configuration with content_getter, priority function, and hash settings
finder_workers int No Number of workers for Stage 2 (default: 1)
data_folder DataFolderLike Yes (Stages 2, 3) Folder containing signature or duplicate files from previous stage
index_folder DataFolderLike No Folder with pre-built index for cross-dataset dedup
exclusion_writer DiskWriter No Writer to save excluded duplicate documents

Outputs

Name Type Description
Signature files Binary Sorted (hash, priority, doc_id) tuples partitioned by hash range
Duplicate files Binary Document IDs flagged as duplicates
Filtered documents DocumentsPipeline Documents with duplicates removed and duplicate_count metadata

Usage Examples

Basic Usage

from datatrove.pipeline.dedup.exact_dedup import (
    ExactDedupConfig,
    ExactDedupSignature,
    ExactFindDedups,
    ExactDedupFilter,
)

config = ExactDedupConfig(
    content_getter=lambda doc: doc.text,
    document_priority=lambda doc: int(doc.metadata.get("quality_score", 1)),
)

# Stage 1: Generate signatures
stage1 = ExactDedupSignature(
    output_folder="s3://bucket/dedup/sigs",
    config=config,
    finder_workers=4,
)

# Stage 2: Find duplicates
stage2 = ExactFindDedups(
    data_folder="s3://bucket/dedup/sigs",
    output_folder="s3://bucket/dedup/dups",
    config=config,
)

# Stage 3: Filter documents
stage3 = ExactDedupFilter(
    data_folder="s3://bucket/dedup/dups",
    config=config,
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment