Principle:NVIDIA NeMo Curator Image Deduplication

Metadata
Knowledge Sources	N/A
Domains	Data_Curation, Image_Processing, Deduplication
Last Updated	2026-02-14

Overview

Image Deduplication is a technique for removing duplicate images from curated datasets based on pre-computed duplicate ID lists, ensuring dataset uniqueness and reducing redundancy.

Description

Image Deduplication in NeMo Curator takes pre-computed duplicate ID parquet files, typically generated by an upstream semantic deduplication process, and removes matching images from batches. Rather than performing the computationally expensive deduplication detection inline, this stage focuses on the efficient removal of already-identified duplicates. The duplicate ID lists are loaded from Parquet files, and each image in the batch is checked against the duplicate set using its unique image identifier. Images whose IDs appear in the duplicate set are removed from the batch, producing a deduplicated output.

Usage

Use Image Deduplication after semantic deduplication detection has produced duplicate ID lists in Parquet format. This stage is appropriate when the duplicate detection and duplicate removal are performed as separate pipeline steps, which is common in large-scale image curation workflows where deduplication detection may use specialized clustering or hashing algorithms. Apply this stage before the final export stage to ensure the output dataset contains only unique images.

Theoretical Basis

Image Deduplication operates on the principle of set membership filtering using pre-computed duplicate identifiers. The upstream semantic deduplication process (such as clustering in CLIP embedding space or perceptual hashing) produces lists of image IDs that have been identified as duplicates. The removal stage then performs a simple but efficient set membership test: for each image in a batch, it checks whether the image's unique identifier exists in the pre-loaded set of duplicate IDs. This separation of detection and removal allows the deduplication detection to be performed using specialized algorithms (e.g., approximate nearest neighbor search, locality-sensitive hashing, or clustering) while the removal itself is a straightforward and fast filtering operation. The use of Parquet format for storing duplicate IDs enables efficient columnar storage and fast loading of large duplicate ID lists.

Related Pages

Implementation:NVIDIA_NeMo_Curator_ImageDuplicatesRemovalStage

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment