Workflow:NVIDIA NeMo Curator Image Curation Pipeline
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, Computer_Vision, Generative_AI |
| Last Updated | 2026-02-14 17:00 GMT |
Overview
End-to-end process for curating high-quality image datasets from tar archives for training generative AI models, vision-language models, and multimodal foundation models using NeMo Curator's GPU-accelerated pipeline.
Description
This workflow outlines the standard procedure for processing large-scale image datasets packaged as tar archives through quality filtering, content filtering, embedding generation, semantic deduplication, and export. The pipeline leverages NVIDIA DALI for GPU-accelerated image decoding, CLIP ViT-L/14 for embedding generation and quality scoring, and semantic clustering for deduplication. Each stage operates on ImageBatch task objects that carry both image tensors and associated metadata. The pipeline supports distributed execution across multiple GPUs and nodes using Ray.
Usage
Execute this workflow when you have a collection of images (typically in tar archive format with associated metadata) and need to produce a filtered, deduplicated, high-quality image dataset for training text-to-image models, vision-language models, or multimodal foundation models.
Execution Steps
Step 1: Image Ingestion
Load images from tar archives into the pipeline using DALI-based GPU-accelerated image decoding. The FilePartitioningStage discovers and groups tar archive files into balanced partitions for parallel processing. The ImageReaderStage uses NVIDIA DALI to decode JPEG images from tar archives directly on the GPU, producing ImageBatch tasks containing decoded image tensors and associated metadata from accompanying JSON sidecar files.
Key considerations:
- Input images must be in tar archive format with JPEG images
- DALI provides hardware-accelerated JPEG decoding on NVIDIA GPUs
- JSON sidecar files within tar archives provide per-image metadata
- File partitioning balances archive sizes across GPU workers
- Image resolution and format are preserved during decoding
Step 2: CLIP Embedding Generation
Generate dense vector embeddings for each image using the CLIP ViT-L/14 model. The ClipEmbedderStage processes decoded image tensors through the CLIP image encoder to produce fixed-dimensional embedding vectors. These embeddings are used by subsequent filtering, scoring, and deduplication stages. The embeddings capture semantic content and visual characteristics of each image.
Key considerations:
- CLIP ViT-L/14 model requires GPU memory for inference
- Embeddings are stored as part of the ImageBatch metadata
- Batch processing amortizes model loading overhead across multiple images
- Embedding dimensionality is fixed by the CLIP model architecture (768-dimensional)
Step 3: Aesthetic Quality Filtering
Score and filter images based on aesthetic quality using a trained MLP classifier on top of CLIP embeddings. The AestheticFilterStage applies a multi-layer perceptron that maps CLIP embeddings to aesthetic quality scores. Images scoring below a configurable threshold are removed from the dataset. This step removes low-quality, blurry, poorly composed, or visually unappealing images.
Key considerations:
- The aesthetic scorer MLP operates on pre-computed CLIP embeddings
- Quality threshold is configurable to balance dataset quality vs. size
- The model is based on the LAION aesthetic predictor architecture
- Scores typically range from 1 (low quality) to 10 (high quality)
Step 4: NSFW Content Filtering
Detect and remove not-safe-for-work (NSFW) images using a trained classifier on CLIP embeddings. The NSFWFilterStage applies an NSFW detection model to classify images and removes those flagged as inappropriate content. This ensures the curated dataset is suitable for safe model training and deployment.
Key considerations:
- NSFW classifier operates on pre-computed CLIP embeddings
- Classification threshold is configurable for different safety requirements
- The model produces a probability score for NSFW content
- Both the aesthetic and NSFW filters share the same CLIP embeddings, avoiding redundant computation
Step 5: Semantic Deduplication
Remove near-duplicate images using embedding-based semantic deduplication. The DuplicateRemovalStage uses the pre-computed CLIP embeddings to identify and remove semantically similar images. This process clusters embeddings and computes pairwise similarity within clusters to find near-duplicates above a configurable similarity threshold.
Key considerations:
- Deduplication operates on CLIP embeddings computed in Step 2
- Similarity threshold controls the aggressiveness of duplicate removal
- Connected components algorithm groups duplicates, retaining one representative per group
- Significantly reduces dataset size while preserving visual diversity
Step 6: Export
Write the curated image dataset to output tar archives with associated Parquet metadata. The ImageWriterStage packages filtered images back into tar archives with configurable shard sizes and writes metadata (embeddings, quality scores, filter flags) to Parquet files. The output format is compatible with WebDataset-style training pipelines.
Key considerations:
- Output tar archives contain filtered images and updated metadata
- Parquet sidecar files store per-image metadata including embeddings and scores
- Shard sizes are configurable to optimize for downstream data loading
- Both local and cloud storage (via fsspec) are supported as output destinations
- Image-to-document batch conversion stage enables interoperability with text pipelines