Implementation:NVIDIA NeMo Curator ImageDuplicatesRemovalStage
| Metadata | |
|---|---|
| Knowledge Sources | N/A |
| Domains | Data_Curation, Image_Processing, Deduplication |
| Last Updated | 2026-02-14 |
Overview
ImageDuplicatesRemovalStage is a processing stage that removes duplicate images from batches by filtering out images whose IDs appear in pre-computed duplicate ID Parquet files.
Description
ImageDuplicatesRemovalStage is a dataclass-based processing stage that implements the ProcessingStage[ImageBatch, ImageBatch] interface. It loads pre-computed duplicate image IDs from Parquet files stored in a specified directory and filters out any images from the input batch whose identifiers match entries in the duplicate set. The stage supports configurable duplicate ID field names, worker counts per node, and verbosity. It is designed to work with duplicate ID lists produced by upstream semantic deduplication processes such as embedding-based clustering or perceptual hashing.
Usage
Use ImageDuplicatesRemovalStage after duplicate detection has been performed and duplicate ID lists have been written to Parquet files. Set removal_parquets_dir to the directory containing the duplicate ID Parquet files. Configure duplicate_id_field if the ID column name in the Parquet files differs from the default "id".
Code Reference
Source Location
nemo_curator/stages/image/deduplication/removal.py, lines 27-104.
Signature
@dataclass
class ImageDuplicatesRemovalStage(ProcessingStage[ImageBatch, ImageBatch]):
removal_parquets_dir: str
duplicate_id_field: str = "id"
verbose: bool = False
num_workers_per_node: int | None = None
name: str = "image_dedup_filter"
Import
from nemo_curator.stages.image.deduplication.removal import ImageDuplicatesRemovalStage
I/O Contract
| Direction | Type | Description |
|---|---|---|
| Input | ImageBatch
|
An ImageBatch containing ImageObject instances with image_id fields.
|
| Output | ImageBatch
|
A filtered ImageBatch containing only images whose image_id does not appear in the pre-computed removal set.
|
Usage Examples
from nemo_curator.stages.image.deduplication.removal import ImageDuplicatesRemovalStage
# Create deduplication removal stage
dedup_removal = ImageDuplicatesRemovalStage(
removal_parquets_dir="/path/to/duplicate_ids/",
duplicate_id_field="id",
verbose=True,
)
# Create deduplication removal stage with custom settings
dedup_removal_custom = ImageDuplicatesRemovalStage(
removal_parquets_dir="/data/dedup_results/duplicates/",
duplicate_id_field="image_hash",
num_workers_per_node=4,
verbose=False,
name="custom_dedup_filter",
)