Principle:NVIDIA NeMo Curator Fuzzy Duplicate Identification

Principle Metadata
Attribute	Value
Domains	Data_Curation, Deduplication
Implemented By	NVIDIA_NeMo_Curator_Fuzzy_IdentifyDuplicatesStage
Last Updated	2026-02-14 17:00 GMT

Overview

Fuzzy Duplicate Identification is a technique for selecting which documents to remove from each duplicate cluster, retaining one representative per group.

Description

Within each connected component (duplicate group), Fuzzy Duplicate Identification selects one document to keep and marks the rest for removal. The current implementation uses a first-encountered strategy via cuDF.duplicated(keep="first"), which deterministically retains the first document encountered in each group and flags all subsequent documents as duplicates.

The process works as follows:

Shuffle by group ID — Documents are shuffled (repartitioned) so that all members of the same duplicate group are co-located on the same worker. This is essential for efficient within-group deduplication.
Within-group selection — For each group, the first document (by _curator_dedup_id order) is retained and all others are marked as duplicates.
Output — The stage outputs only the _curator_dedup_id values of documents that should be removed, enabling downstream stages to filter them from the original dataset.

This stage extends ShuffleStage, which provides the infrastructure for repartitioning data by a key column across distributed workers.

Usage

Fuzzy Duplicate Identification is the sixth and final stage in the fuzzy deduplication pipeline, following Connected Component Analysis. It reads connected component Parquet files and produces a list of document IDs to remove.

from nemo_curator.stages.deduplication.fuzzy.identify_duplicates import IdentifyDuplicatesStage

stage = IdentifyDuplicatesStage(
    output_path="/output/duplicates_to_remove/",
    duplicate_group_field="_duplicate_group_id",
    document_id_field="_curator_dedup_id",
)

Theoretical Basis

Given duplicate groups identified by Connected Component Analysis, the problem reduces to selecting one representative per group. The theoretical basis involves:

Deterministic selection — The keep="first" strategy provides a deterministic, reproducible selection rule. Given the same input order, the same representative document is always retained.
Shuffle-based co-location — Shuffling by group ID ensures all group members are co-located on the same worker for efficient processing. Without this shuffle, within-group deduplication would require expensive cross-worker communication.
Minimal output — Rather than outputting all documents with a keep/remove flag, the stage outputs only the IDs of documents to remove. This minimizes output size, since typically only a fraction of documents are duplicates.

Selection strategies:

The current implementation uses a simple first-encountered strategy, which is:

Fast — No need to read document content or compute additional metrics.
Deterministic — Reproducible results given stable input ordering.
Content-agnostic — Does not consider document quality, length, or other features when choosing the representative.

Alternative strategies (not currently implemented) could include:

Longest document — Retain the document with the most content.
Highest quality score — Retain the document with the highest quality metric from a classifier.
Most recent — Retain the most recently crawled or published version.

Related Pages

Implementation:NVIDIA_NeMo_Curator_Fuzzy_IdentifyDuplicatesStage
NVIDIA_NeMo_Curator_Connected_Component_Analysis — The preceding stage that identifies duplicate groups
NVIDIA_NeMo_Curator_Text_Deduplication — The parent concept covering all deduplication techniques
NVIDIA_NeMo_Curator_FuzzyDeduplicationWorkflow — The parent workflow that orchestrates all deduplication stages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment