Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:NVIDIA NeMo Curator ImageDuplicatesRemovalStage

From Leeroopedia
Metadata
Knowledge Sources N/A
Domains Data_Curation, Image_Processing, Deduplication
Last Updated 2026-02-14

Overview

ImageDuplicatesRemovalStage is a processing stage that removes duplicate images from batches by filtering out images whose IDs appear in pre-computed duplicate ID Parquet files.

Description

ImageDuplicatesRemovalStage is a dataclass-based processing stage that implements the ProcessingStage[ImageBatch, ImageBatch] interface. It loads pre-computed duplicate image IDs from Parquet files stored in a specified directory and filters out any images from the input batch whose identifiers match entries in the duplicate set. The stage supports configurable duplicate ID field names, worker counts per node, and verbosity. It is designed to work with duplicate ID lists produced by upstream semantic deduplication processes such as embedding-based clustering or perceptual hashing.

Usage

Use ImageDuplicatesRemovalStage after duplicate detection has been performed and duplicate ID lists have been written to Parquet files. Set removal_parquets_dir to the directory containing the duplicate ID Parquet files. Configure duplicate_id_field if the ID column name in the Parquet files differs from the default "id".

Code Reference

Source Location

nemo_curator/stages/image/deduplication/removal.py, lines 27-104.

Signature

@dataclass
class ImageDuplicatesRemovalStage(ProcessingStage[ImageBatch, ImageBatch]):
    removal_parquets_dir: str
    duplicate_id_field: str = "id"
    verbose: bool = False
    num_workers_per_node: int | None = None
    name: str = "image_dedup_filter"

Import

from nemo_curator.stages.image.deduplication.removal import ImageDuplicatesRemovalStage

I/O Contract

Direction Type Description
Input ImageBatch An ImageBatch containing ImageObject instances with image_id fields.
Output ImageBatch A filtered ImageBatch containing only images whose image_id does not appear in the pre-computed removal set.

Usage Examples

from nemo_curator.stages.image.deduplication.removal import ImageDuplicatesRemovalStage

# Create deduplication removal stage
dedup_removal = ImageDuplicatesRemovalStage(
    removal_parquets_dir="/path/to/duplicate_ids/",
    duplicate_id_field="id",
    verbose=True,
)

# Create deduplication removal stage with custom settings
dedup_removal_custom = ImageDuplicatesRemovalStage(
    removal_parquets_dir="/data/dedup_results/duplicates/",
    duplicate_id_field="image_hash",
    num_workers_per_node=4,
    verbose=False,
    name="custom_dedup_filter",
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment