Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Huggingface Datatrove Duplicate Clustering

From Leeroopedia
Metadata
Knowledge Sources
Domains
Last Updated 2026-02-14 00:00 GMT

Overview

Grouping candidate duplicate pairs into connected components using union-find to identify document clusters. Within each cluster, one document is designated as the keeper (the smallest ID) and all others are marked for removal. This transitive closure step is essential for correct deduplication.

Description

Duplicate clustering takes the pairwise duplicate matches produced by the bucket matching stage and groups them into clusters using a union-find (disjoint set) data structure. The process works as follows:

  1. Read duplicate pairs: All .dups files from the bucket matching stage are read. Each record contains a (file_id1, doc_id1, file_id2, doc_id2) tuple representing a candidate duplicate pair.
  2. Build union-find: Each unique (file_id, doc_id) pair becomes a node. The union operation merges the sets of two matched documents. The implementation uses path compression and union by size for efficiency.
  3. Handle index matches: Pairs where one side is a sentinel value (0xFFFFFFFF, 0xFFFFFFFF) represent matches against a pre-existing index. The sentinel is always made the root of such unions, ensuring that any document matching an indexed entry will be marked for removal.
  4. Determine keepers vs. removals: After all pairs are processed, every node whose parent is not itself (i.e., it is not the root of its cluster) is marked for removal. The root of each cluster is the keeper.
  5. Write removal lists: For each rank (file), a .remove file is produced listing the document IDs to be removed.

The transitive closure property is critical: if document A matches document B, and document B matches document C, then A, B, and C form one cluster -- even if A and C were never directly compared. Without this step, deduplication would only remove exact band-level matches, missing transitive near-duplicates.

Usage

This is the third stage of the 4-stage MinHash dedup pipeline, run after bucket matching. Key properties:

  • Must run with world_size=1 because the union-find structure must see all duplicate pairs globally
  • Reads all .dups files from the bucket matching stage
  • Produces per-rank .remove files listing document IDs to filter
  • Optionally produces .clusters and .sizes metadata files for analysis
  • Supports ignore_index_matches to skip pairs involving the index sentinel

Theoretical Basis

The clustering stage applies two fundamental algorithms:

Union-Find (Disjoint Set Union): The union-find data structure maintains a forest of trees, where each tree represents a cluster. Two optimizations ensure near-constant time per operation:

  • Path compression: During find(x), every node on the path from x to the root is updated to point directly to the root, flattening the tree.
  • Union by size: During union(a, b), the smaller tree is attached under the root of the larger tree, keeping tree depth logarithmic.

With both optimizations, each find and union operation runs in amortized O(alpha(n)) time, where alpha is the inverse Ackermann function -- effectively constant for all practical input sizes.

UNION-FIND CLUSTERING:
    For each duplicate pair (a, b) from all .dups files:
        root_a = FIND(a)     # with path compression
        root_b = FIND(b)
        if root_a != root_b:
            UNION(root_a, root_b)  # by size; sentinel always becomes root

    For each node in sorted order:
        if FIND(node) != node:
            EMIT node to .remove file for its rank

Transitive Closure: The union-find inherently computes the transitive closure of the "is duplicate of" relation. If A ~ B and B ~ C, then after union(A, B) and union(B, C), all three share the same root. This captures the full equivalence class without needing to enumerate all transitive pairs explicitly.

Sentinel Priority: The implementation gives special treatment to the sentinel value (SENTINEL, SENTINEL) representing index matches. When unioning with a sentinel, the sentinel always becomes (or remains) the root. This ensures that any document matching an indexed entry is always marked for removal rather than kept.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment