Principle:NVIDIA NeMo Curator Fuzzy Duplicate Identification
| Attribute | Value |
|---|---|
| Domains | Data_Curation, Deduplication |
| Implemented By | NVIDIA_NeMo_Curator_Fuzzy_IdentifyDuplicatesStage |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
Fuzzy Duplicate Identification is a technique for selecting which documents to remove from each duplicate cluster, retaining one representative per group.
Description
Within each connected component (duplicate group), Fuzzy Duplicate Identification selects one document to keep and marks the rest for removal. The current implementation uses a first-encountered strategy via cuDF.duplicated(keep="first"), which deterministically retains the first document encountered in each group and flags all subsequent documents as duplicates.
The process works as follows:
- Shuffle by group ID — Documents are shuffled (repartitioned) so that all members of the same duplicate group are co-located on the same worker. This is essential for efficient within-group deduplication.
- Within-group selection — For each group, the first document (by
_curator_dedup_idorder) is retained and all others are marked as duplicates. - Output — The stage outputs only the
_curator_dedup_idvalues of documents that should be removed, enabling downstream stages to filter them from the original dataset.
This stage extends ShuffleStage, which provides the infrastructure for repartitioning data by a key column across distributed workers.
Usage
Fuzzy Duplicate Identification is the sixth and final stage in the fuzzy deduplication pipeline, following Connected Component Analysis. It reads connected component Parquet files and produces a list of document IDs to remove.
from nemo_curator.stages.deduplication.fuzzy.identify_duplicates import IdentifyDuplicatesStage
stage = IdentifyDuplicatesStage(
output_path="/output/duplicates_to_remove/",
duplicate_group_field="_duplicate_group_id",
document_id_field="_curator_dedup_id",
)
Theoretical Basis
Given duplicate groups identified by Connected Component Analysis, the problem reduces to selecting one representative per group. The theoretical basis involves:
- Deterministic selection — The
keep="first"strategy provides a deterministic, reproducible selection rule. Given the same input order, the same representative document is always retained. - Shuffle-based co-location — Shuffling by group ID ensures all group members are co-located on the same worker for efficient processing. Without this shuffle, within-group deduplication would require expensive cross-worker communication.
- Minimal output — Rather than outputting all documents with a keep/remove flag, the stage outputs only the IDs of documents to remove. This minimizes output size, since typically only a fraction of documents are duplicates.
Selection strategies:
The current implementation uses a simple first-encountered strategy, which is:
- Fast — No need to read document content or compute additional metrics.
- Deterministic — Reproducible results given stable input ordering.
- Content-agnostic — Does not consider document quality, length, or other features when choosing the representative.
Alternative strategies (not currently implemented) could include:
- Longest document — Retain the document with the most content.
- Highest quality score — Retain the document with the highest quality metric from a classifier.
- Most recent — Retain the most recently crawled or published version.
Related Pages
- Implementation:NVIDIA_NeMo_Curator_Fuzzy_IdentifyDuplicatesStage
- NVIDIA_NeMo_Curator_Connected_Component_Analysis — The preceding stage that identifies duplicate groups
- NVIDIA_NeMo_Curator_Text_Deduplication — The parent concept covering all deduplication techniques
- NVIDIA_NeMo_Curator_FuzzyDeduplicationWorkflow — The parent workflow that orchestrates all deduplication stages