Implementation:NVIDIA NeMo Curator ImageReaderStage
| Metadata | |
|---|---|
| Knowledge Sources | N/A |
| Domains | Data_Curation, Image_Processing |
| Last Updated | 2026-02-14 |
Overview
ImageReaderStage is a processing stage that reads image datasets from WebDataset tar archives using GPU-accelerated DALI pipelines, producing structured ImageBatch objects containing decoded image data as NumPy arrays.
Description
ImageReaderStage is a dataclass-based processing stage that implements the ProcessingStage[FileGroupTask, ImageBatch] interface. It accepts file group tasks containing paths to tar archives and uses NVIDIA DALI to decode the images contained within those archives. The stage supports configurable batch sizes, threading, and GPU allocation per worker. Images are decoded into NumPy arrays in [H, W, C] RGB format, and each image is wrapped in an ImageObject containing the image path, a unique image identifier, and the decoded image data. The stage produces a list of ImageBatch objects as output, enabling downstream stages to process images in efficient batches.
Usage
Use ImageReaderStage as the entry point of an image curation pipeline when source data is stored in WebDataset tar archives. Configure the dali_batch_size parameter to control memory usage and throughput, and adjust num_threads and num_gpus_per_worker based on available hardware resources.
Code Reference
Source Location
nemo_curator/stages/image/io/image_reader.py, lines 28-152.
Signature
@dataclass
class ImageReaderStage(ProcessingStage[FileGroupTask, ImageBatch]):
dali_batch_size: int = 100
verbose: bool = True
num_threads: int = 8
num_gpus_per_worker: float = 0.25
name: str = "image_reader"
Import
from nemo_curator.stages.image.io.image_reader import ImageReaderStage
I/O Contract
| Direction | Type | Description |
|---|---|---|
| Input | FileGroupTask
|
A task containing paths to tar files (WebDataset shards) to be read and decoded. |
| Output | list[ImageBatch]
|
A list of ImageBatch objects, each containing ImageObject instances with image_path, image_id, and image_data as NumPy arrays in [H, W, C] RGB format.
|
Usage Examples
from nemo_curator.stages.image.io.image_reader import ImageReaderStage
# Create reader stage with default settings
reader = ImageReaderStage(
dali_batch_size=100,
num_threads=8,
num_gpus_per_worker=0.25,
)
# Create reader stage with custom batch size for large images
reader_large = ImageReaderStage(
dali_batch_size=50,
verbose=True,
num_threads=4,
num_gpus_per_worker=0.5,
name="large_image_reader",
)