Principle:NVIDIA NeMo Curator Image Ingestion

Metadata
Knowledge Sources	N/A
Domains	Data_Curation, Image_Processing
Last Updated	2026-02-14

Overview

Image Ingestion is a technique for reading image datasets from WebDataset tar archives using GPU-accelerated DALI pipelines, enabling high-throughput data loading for large-scale image curation workflows.

Description

Image Ingestion in NeMo Curator leverages NVIDIA's Data Loading Library (DALI) to read image data stored in WebDataset tar archive format. DALI pipelines provide GPU-accelerated image decoding, which significantly increases throughput compared to CPU-only decoding approaches. The ingestion process reads tar shards containing image files, decodes them using either CUDA or CPU backends, and produces structured image batches suitable for downstream processing stages such as embedding computation, filtering, and deduplication. Each decoded image is represented as a NumPy array in [H, W, C] RGB format, accompanied by metadata including the image path and a unique image identifier.

Usage

Use Image Ingestion as the first stage in any NeMo Curator image curation pipeline. It is appropriate when the source data is stored in WebDataset tar archives and high-throughput GPU-accelerated decoding is desired. This stage should be applied before any embedding, filtering, or deduplication stages that require decoded image data as input.

Theoretical Basis

The Image Ingestion technique is built on the DALI (Data Loading Library) pipeline architecture for high-performance data loading. DALI reads image data directly from WebDataset tar format, which stores images as sequential entries in tar archives. This sequential storage format is optimized for streaming I/O and avoids the overhead of individual file system operations for each image. DALI then decodes the images on the GPU using CUDA-accelerated JPEG and PNG decoders, which offloads the computationally intensive decompression work from the CPU to the GPU. When GPU decoding is not available or not desired, DALI falls back to CPU-based decoding. The combination of sequential I/O from tar archives and GPU-accelerated decoding provides substantially higher throughput than traditional CPU-based image loading approaches, making it suitable for processing millions of images in large-scale curation pipelines.

Related Pages

Implementation:NVIDIA_NeMo_Curator_ImageReaderStage

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment