Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Huggingface Datatrove Duplicate Removal

From Leeroopedia
Metadata
Knowledge Sources
Domains
Last Updated 2026-02-14 00:00 GMT

Overview

Removing documents identified as duplicates from the document stream using pre-computed removal lists. This is the final stage of the MinHash deduplication pipeline, which reads the removal lists produced by the clustering stage and filters out all documents marked for deletion, keeping exactly one representative per cluster.

Description

Duplicate removal is the fourth and final stage of the MinHash deduplication pipeline. It operates as a streaming filter over the original DocumentsPipeline, consulting per-rank removal lists to decide which documents to pass through and which to drop.

The process works as follows:

  1. Load removal list: For the current worker rank, load the .remove file containing a sorted sequence of document IDs (32-bit unsigned integers) that should be removed.
  2. Stream documents: Iterate through the input DocumentsPipeline, maintaining a pointer into the removal list.
  3. Filter by index: For each document, compare its sequential index against the next removal ID. If the index matches, the document is dropped (or optionally redirected to an exclusion writer for analysis). If it does not match, the document is forwarded downstream.
  4. Load optional metadata: If cluster IDs or cluster sizes were saved during the clustering stage, these are loaded from .clusters and .sizes files and attached to each document's metadata dictionary.

The filter is stateless with respect to document content -- it makes decisions purely based on the sequential document index and the pre-computed removal list. This makes it efficient and idempotent.

If no .remove file exists for a given rank, all documents from that rank are passed through with a warning, under the assumption that no duplicates were found for that shard.

Usage

This is the final stage of the 4-stage MinHash dedup pipeline, applied to the original document stream. Key properties:

  • Takes the same DocumentsPipeline that was used in the signature computation stage
  • Documents must be in the same order as when signatures were computed (the index-based filtering depends on sequential ordering)
  • Supports optional exclusion writer for redirecting removed documents to a separate output for inspection
  • Can load cluster IDs and cluster sizes as document metadata for downstream analysis

Theoretical Basis

The duplicate removal stage implements a sorted merge join between the document stream and the removal list:

DUPLICATE_REMOVAL(documents, removal_list):
    next_removal = READ_NEXT(removal_list)
    For each (idx, doc) in ENUMERATE(documents):
        if idx == next_removal:
            DROP(doc)                    # or write to exclusion
            next_removal = READ_NEXT(removal_list)
        else:
            EMIT(doc)                    # pass through

This runs in O(N) time where N is the total number of documents, with O(1) memory overhead beyond the single look-ahead pointer into the removal list. The removal list is consumed sequentially because both the document stream and the removal IDs are in ascending index order.

Correctness guarantee: The clustering stage ensures that for each cluster of near-duplicate documents, exactly one member (the cluster root) is absent from all removal lists. All non-root members appear in their respective rank's removal list. Therefore, after filtering, exactly one copy of each near-duplicate group survives.

Exclusion writing: The optional exclusion writer implements the tee pattern -- removed documents are written to a separate output stream rather than being silently discarded. This enables post-hoc analysis of deduplication decisions (e.g., inspecting which documents were removed and why).

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment