Workflow:NVIDIA NeMo Curator Image Curation Pipeline

Knowledge Sources	NeMo Curator NeMo Curator Docs
Domains	Data_Engineering, Computer_Vision, Generative_AI
Last Updated	2026-02-14 17:00 GMT

Overview

End-to-end process for curating high-quality image datasets from tar archives for training generative AI models, vision-language models, and multimodal foundation models using NeMo Curator's GPU-accelerated pipeline.

Description

This workflow outlines the standard procedure for processing large-scale image datasets packaged as tar archives through quality filtering, content filtering, embedding generation, semantic deduplication, and export. The pipeline leverages NVIDIA DALI for GPU-accelerated image decoding, CLIP ViT-L/14 for embedding generation and quality scoring, and semantic clustering for deduplication. Each stage operates on ImageBatch task objects that carry both image tensors and associated metadata. The pipeline supports distributed execution across multiple GPUs and nodes using Ray.

Usage

Execute this workflow when you have a collection of images (typically in tar archive format with associated metadata) and need to produce a filtered, deduplicated, high-quality image dataset for training text-to-image models, vision-language models, or multimodal foundation models.

Execution Steps

Step 1: Image Ingestion

Load images from tar archives into the pipeline using DALI-based GPU-accelerated image decoding. The FilePartitioningStage discovers and groups tar archive files into balanced partitions for parallel processing. The ImageReaderStage uses NVIDIA DALI to decode JPEG images from tar archives directly on the GPU, producing ImageBatch tasks containing decoded image tensors and associated metadata from accompanying JSON sidecar files.

Key considerations:

Input images must be in tar archive format with JPEG images
DALI provides hardware-accelerated JPEG decoding on NVIDIA GPUs
JSON sidecar files within tar archives provide per-image metadata
File partitioning balances archive sizes across GPU workers
Image resolution and format are preserved during decoding

Step 2: CLIP Embedding Generation

Generate dense vector embeddings for each image using the CLIP ViT-L/14 model. The ClipEmbedderStage processes decoded image tensors through the CLIP image encoder to produce fixed-dimensional embedding vectors. These embeddings are used by subsequent filtering, scoring, and deduplication stages. The embeddings capture semantic content and visual characteristics of each image.

Key considerations:

CLIP ViT-L/14 model requires GPU memory for inference
Embeddings are stored as part of the ImageBatch metadata
Batch processing amortizes model loading overhead across multiple images
Embedding dimensionality is fixed by the CLIP model architecture (768-dimensional)

Step 3: Aesthetic Quality Filtering

Score and filter images based on aesthetic quality using a trained MLP classifier on top of CLIP embeddings. The AestheticFilterStage applies a multi-layer perceptron that maps CLIP embeddings to aesthetic quality scores. Images scoring below a configurable threshold are removed from the dataset. This step removes low-quality, blurry, poorly composed, or visually unappealing images.

Key considerations:

The aesthetic scorer MLP operates on pre-computed CLIP embeddings
Quality threshold is configurable to balance dataset quality vs. size
The model is based on the LAION aesthetic predictor architecture
Scores typically range from 1 (low quality) to 10 (high quality)

Step 4: NSFW Content Filtering

Detect and remove not-safe-for-work (NSFW) images using a trained classifier on CLIP embeddings. The NSFWFilterStage applies an NSFW detection model to classify images and removes those flagged as inappropriate content. This ensures the curated dataset is suitable for safe model training and deployment.

Key considerations:

NSFW classifier operates on pre-computed CLIP embeddings
Classification threshold is configurable for different safety requirements
The model produces a probability score for NSFW content
Both the aesthetic and NSFW filters share the same CLIP embeddings, avoiding redundant computation

Step 5: Semantic Deduplication

Remove near-duplicate images using embedding-based semantic deduplication. The DuplicateRemovalStage uses the pre-computed CLIP embeddings to identify and remove semantically similar images. This process clusters embeddings and computes pairwise similarity within clusters to find near-duplicates above a configurable similarity threshold.

Key considerations:

Deduplication operates on CLIP embeddings computed in Step 2
Similarity threshold controls the aggressiveness of duplicate removal
Connected components algorithm groups duplicates, retaining one representative per group
Significantly reduces dataset size while preserving visual diversity

Step 6: Export

Write the curated image dataset to output tar archives with associated Parquet metadata. The ImageWriterStage packages filtered images back into tar archives with configurable shard sizes and writes metadata (embeddings, quality scores, filter flags) to Parquet files. The output format is compatible with WebDataset-style training pipelines.

Key considerations:

Output tar archives contain filtered images and updated metadata
Parquet sidecar files store per-image metadata including embeddings and scores
Shard sizes are configurable to optimize for downstream data loading
Both local and cloud storage (via fsspec) are supported as output destinations
Image-to-document batch conversion stage enables interoperability with text pipelines

Execution Diagram

GitHub URL

Workflow Repository