Implementation:NVIDIA NeMo Curator ImageBatch

Knowledge Sources	NVIDIA NeMo Curator
Domains	Data Curation, Image Processing, Pipeline Tasks
Last Updated	2026-02-14 00:00 GMT

Overview

The ImageBatch and ImageObject classes define the data structures for image processing tasks in the NeMo Curator pipeline, carrying images and their accumulated annotations through pipeline stages.

Description

This module defines two complementary dataclasses:

ImageObject represents a single image with its associated metadata and computed attributes:

image_path (str): Path to the image file on disk.
image_id (str): Unique identifier for the image.
metadata (dict[str, Any]): Arbitrary metadata dictionary.
image_data (np.ndarray or None): Raw pixel data as a numpy array in HWC RGB format (Height x Width x Channels).
embedding (np.ndarray or None): Image embedding vector as a numpy array, typically produced by stages like CLIP embedding.
aesthetic_score (float or None): Aesthetic quality score.
nsfw_score (float or None): NSFW probability score.

ImageBatch extends Task with a list of ImageObject instances as its data. It provides:

data (list[ImageObject]): The batch of image objects, defaulting to an empty list.
num_items property: Returns the number of images via len(self.data).
validate(): Currently a placeholder that always returns True (marked with a TODO for future implementation of image path existence checks).

The ImageObject fields accumulate annotations as images pass through pipeline stages -- for example, the embedding field is populated by an embedding stage, and aesthetic_score and nsfw_score are populated by classification stages.

Usage

Use ImageBatch and ImageObject when building image curation workflows. ImageBatch is the task type consumed and produced by image processing stages such as CLIP embedding, aesthetic scoring, NSFW filtering, and deduplication.

Code Reference

Source Location

Repository: NeMo-Curator
File: nemo_curator/tasks/image.py
Lines: 1-69

Signature

@dataclass
class ImageObject:
    image_path: str = ""
    image_id: str = ""
    metadata: dict[str, Any] = field(default_factory=dict)
    image_data: np.ndarray | None = None
    embedding: np.ndarray | None = None
    aesthetic_score: float | None = None
    nsfw_score: float | None = None


@dataclass
class ImageBatch(Task):
    data: list[ImageObject] = field(default_factory=list)

    def validate(self) -> bool: ...

    @property
    def num_items(self) -> int: ...

Import

from nemo_curator.tasks.image import ImageBatch, ImageObject
# or
from nemo_curator.tasks import ImageBatch, ImageObject

I/O Contract

ImageObject Fields

Name	Type	Required	Description
image_path	str	No	Path to the image file on disk (default: empty string)
image_id	str	No	Unique identifier for the image (default: empty string)
metadata	dict[str, Any]	No	Arbitrary metadata dictionary (default: empty dict)
image_data	np.ndarray or None	No	Raw pixel data in HWC RGB format
embedding	np.ndarray or None	No	Embedding vector produced by embedding stages
aesthetic_score	float or None	No	Aesthetic quality score from classification stages
nsfw_score	float or None	No	NSFW probability score from classification stages

ImageBatch Inputs

Name	Type	Required	Description
data	list[ImageObject]	No	List of image objects (default: empty list)
task_id	str	Yes	Unique identifier for this task (inherited from Task)
dataset_name	str	Yes	Name of the dataset this task belongs to (inherited from Task)

ImageBatch Outputs

Name	Type	Description
data	list[ImageObject]	The batch of image objects with accumulated annotations
num_items	int	Number of images in the batch
validate()	bool	Currently always returns True (placeholder)

Usage Examples

Creating an ImageBatch

from nemo_curator.tasks.image import ImageBatch, ImageObject

# Create individual image objects
img1 = ImageObject(
    image_path="/data/images/photo1.jpg",
    image_id="img_001",
    metadata={"source": "flickr", "resolution": "1024x768"},
)
img2 = ImageObject(
    image_path="/data/images/photo2.jpg",
    image_id="img_002",
    metadata={"source": "flickr", "resolution": "800x600"},
)

# Create a batch
batch = ImageBatch(
    task_id="image_task_001",
    dataset_name="flickr_dataset",
    data=[img1, img2],
)

print(batch.num_items)  # 2

Accessing Image Annotations

# After pipeline stages have populated scores and embeddings
for img in batch.data:
    if img.aesthetic_score is not None:
        print(f"{img.image_id}: aesthetic={img.aesthetic_score:.2f}")
    if img.nsfw_score is not None:
        print(f"{img.image_id}: nsfw={img.nsfw_score:.2f}")

Related Pages

Environment:NVIDIA_NeMo_Curator_Python_Linux_Base
NVIDIA_NeMo_Curator_Task_Base - Abstract base class that ImageBatch extends
NVIDIA_NeMo_Curator_DocumentBatch - Analogous task type for text documents
NVIDIA_NeMo_Curator_AudioBatch - Analogous task type for audio data
NVIDIA_NeMo_Curator_ImageEmbeddingStage - Stage that populates image embeddings
NVIDIA_NeMo_Curator_ImageAestheticFilterStage - Stage that uses aesthetic scores
NVIDIA_NeMo_Curator_ImageNSFWFilterStage - Stage that uses NSFW scores

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment