Principle:Huggingface Datatrove Duplicate Clustering
| Knowledge Sources | |
|---|---|
| Domains | |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
Grouping candidate duplicate pairs into connected components using union-find to identify document clusters. Within each cluster, one document is designated as the keeper (the smallest ID) and all others are marked for removal. This transitive closure step is essential for correct deduplication.
Description
Duplicate clustering takes the pairwise duplicate matches produced by the bucket matching stage and groups them into clusters using a union-find (disjoint set) data structure. The process works as follows:
- Read duplicate pairs: All
.dupsfiles from the bucket matching stage are read. Each record contains a(file_id1, doc_id1, file_id2, doc_id2)tuple representing a candidate duplicate pair. - Build union-find: Each unique
(file_id, doc_id)pair becomes a node. Theunionoperation merges the sets of two matched documents. The implementation uses path compression and union by size for efficiency. - Handle index matches: Pairs where one side is a sentinel value
(0xFFFFFFFF, 0xFFFFFFFF)represent matches against a pre-existing index. The sentinel is always made the root of such unions, ensuring that any document matching an indexed entry will be marked for removal. - Determine keepers vs. removals: After all pairs are processed, every node whose parent is not itself (i.e., it is not the root of its cluster) is marked for removal. The root of each cluster is the keeper.
- Write removal lists: For each rank (file), a
.removefile is produced listing the document IDs to be removed.
The transitive closure property is critical: if document A matches document B, and document B matches document C, then A, B, and C form one cluster -- even if A and C were never directly compared. Without this step, deduplication would only remove exact band-level matches, missing transitive near-duplicates.
Usage
This is the third stage of the 4-stage MinHash dedup pipeline, run after bucket matching. Key properties:
- Must run with
world_size=1because the union-find structure must see all duplicate pairs globally - Reads all
.dupsfiles from the bucket matching stage - Produces per-rank
.removefiles listing document IDs to filter - Optionally produces
.clustersand.sizesmetadata files for analysis - Supports
ignore_index_matchesto skip pairs involving the index sentinel
Theoretical Basis
The clustering stage applies two fundamental algorithms:
Union-Find (Disjoint Set Union): The union-find data structure maintains a forest of trees, where each tree represents a cluster. Two optimizations ensure near-constant time per operation:
- Path compression: During
find(x), every node on the path fromxto the root is updated to point directly to the root, flattening the tree. - Union by size: During
union(a, b), the smaller tree is attached under the root of the larger tree, keeping tree depth logarithmic.
With both optimizations, each find and union operation runs in amortized O(alpha(n)) time, where alpha is the inverse Ackermann function -- effectively constant for all practical input sizes.
UNION-FIND CLUSTERING:
For each duplicate pair (a, b) from all .dups files:
root_a = FIND(a) # with path compression
root_b = FIND(b)
if root_a != root_b:
UNION(root_a, root_b) # by size; sentinel always becomes root
For each node in sorted order:
if FIND(node) != node:
EMIT node to .remove file for its rank
Transitive Closure: The union-find inherently computes the transitive closure of the "is duplicate of" relation. If A ~ B and B ~ C, then after union(A, B) and union(B, C), all three share the same root. This captures the full equivalence class without needing to enumerate all transitive pairs explicitly.
Sentinel Priority: The implementation gives special treatment to the sentinel value (SENTINEL, SENTINEL) representing index matches. When unioning with a sentinel, the sentinel always becomes (or remains) the root. This ensures that any document matching an indexed entry is always marked for removal rather than kept.