Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:NVIDIA NeMo Curator Fuzzy Duplicate Identification

From Leeroopedia
Principle Metadata
Attribute Value
Domains Data_Curation, Deduplication
Implemented By NVIDIA_NeMo_Curator_Fuzzy_IdentifyDuplicatesStage
Last Updated 2026-02-14 17:00 GMT

Overview

Fuzzy Duplicate Identification is a technique for selecting which documents to remove from each duplicate cluster, retaining one representative per group.

Description

Within each connected component (duplicate group), Fuzzy Duplicate Identification selects one document to keep and marks the rest for removal. The current implementation uses a first-encountered strategy via cuDF.duplicated(keep="first"), which deterministically retains the first document encountered in each group and flags all subsequent documents as duplicates.

The process works as follows:

  1. Shuffle by group ID — Documents are shuffled (repartitioned) so that all members of the same duplicate group are co-located on the same worker. This is essential for efficient within-group deduplication.
  2. Within-group selection — For each group, the first document (by _curator_dedup_id order) is retained and all others are marked as duplicates.
  3. Output — The stage outputs only the _curator_dedup_id values of documents that should be removed, enabling downstream stages to filter them from the original dataset.

This stage extends ShuffleStage, which provides the infrastructure for repartitioning data by a key column across distributed workers.

Usage

Fuzzy Duplicate Identification is the sixth and final stage in the fuzzy deduplication pipeline, following Connected Component Analysis. It reads connected component Parquet files and produces a list of document IDs to remove.

from nemo_curator.stages.deduplication.fuzzy.identify_duplicates import IdentifyDuplicatesStage

stage = IdentifyDuplicatesStage(
    output_path="/output/duplicates_to_remove/",
    duplicate_group_field="_duplicate_group_id",
    document_id_field="_curator_dedup_id",
)

Theoretical Basis

Given duplicate groups identified by Connected Component Analysis, the problem reduces to selecting one representative per group. The theoretical basis involves:

  • Deterministic selection — The keep="first" strategy provides a deterministic, reproducible selection rule. Given the same input order, the same representative document is always retained.
  • Shuffle-based co-location — Shuffling by group ID ensures all group members are co-located on the same worker for efficient processing. Without this shuffle, within-group deduplication would require expensive cross-worker communication.
  • Minimal output — Rather than outputting all documents with a keep/remove flag, the stage outputs only the IDs of documents to remove. This minimizes output size, since typically only a fraction of documents are duplicates.

Selection strategies:

The current implementation uses a simple first-encountered strategy, which is:

  • Fast — No need to read document content or compute additional metrics.
  • Deterministic — Reproducible results given stable input ordering.
  • Content-agnostic — Does not consider document quality, length, or other features when choosing the representative.

Alternative strategies (not currently implemented) could include:

  • Longest document — Retain the document with the most content.
  • Highest quality score — Retain the document with the highest quality metric from a classifier.
  • Most recent — Retain the most recently crawled or published version.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment