Implementation:NVIDIA NeMo Curator Image Convert Stage

Knowledge Sources	NVIDIA NeMo Curator
Domains	Image Processing, Data Conversion, Pipeline Integration
Last Updated	2026-02-14 00:00 GMT

Overview

ConvertImageBatchToDocumentBatchStage is a bridge stage that converts an ImageBatch into a DocumentBatch, enabling image pipeline outputs to flow into document-oriented downstream stages such as Parquet writers.

Description

ConvertImageBatchToDocumentBatchStage extends ProcessingStage[ImageBatch, DocumentBatch] as a Python dataclass. Its primary purpose is to extract specified fields from each ImageObject in an ImageBatch and construct a pandas DataFrame that is then wrapped into a DocumentBatch.

When the fields parameter is populated, the stage iterates over each ImageObject and extracts the named attributes using getattr, building a dictionary of lists. If no fields are specified, it defaults to extracting just image_id from each object. The resulting dictionary is converted into a pd.DataFrame and wrapped in a DocumentBatch that preserves the original task metadata including task_id, dataset_name, _metadata, and _stage_perf.

Usage

Use this stage at the boundary between image processing pipelines and document processing pipelines. It is typically placed after image filtering or embedding stages when the downstream stages require tabular DocumentBatch data (e.g., writing image metadata to Parquet format).

Code Reference

Source Location

Repository: NeMo-Curator
File: nemo_curator/stages/image/io/convert.py
Lines: 1-53

Signature

@dataclass
class ConvertImageBatchToDocumentBatchStage(ProcessingStage[ImageBatch, DocumentBatch]):
    fields: list[str] = field(default_factory=list)
    name: str = "convert_image_batch_to_document_batch"

    def process(self, task: ImageBatch) -> DocumentBatch: ...

Import

from nemo_curator.stages.image.io.convert import ConvertImageBatchToDocumentBatchStage

I/O Contract

Inputs

Name	Type	Required	Description
fields	`list[str]`	No	List of `ImageObject` attribute names to extract. Defaults to empty list, which causes the stage to extract only `image_id`.
task	`ImageBatch`	Yes	The image batch containing a list of `ImageObject` instances to convert

Outputs

Name	Type	Description
result	`DocumentBatch`	A document batch containing a pandas DataFrame with columns corresponding to the extracted fields, preserving task metadata

Key Implementation Details

Field Extraction Logic

The process() method handles two cases for field extraction:

def process(self, task: ImageBatch) -> DocumentBatch:
    data = {}
    if self.fields:
        for field in self.fields:
            data[field] = [getattr(image_obj, field, None) for image_obj in task.data]
    else:
        # Default to image_id if no fields specified
        data["image_id"] = [image_obj.image_id for image_obj in task.data]
    df = pd.DataFrame(data)

    return DocumentBatch(
        task_id=f"{task.task_id}_{self.name}",
        dataset_name=task.dataset_name,
        data=df,
        _metadata=task._metadata,
        _stage_perf=task._stage_perf,
    )

When a field does not exist on an ImageObject, getattr returns None, so the resulting DataFrame may contain null values for missing attributes.

Task ID Propagation

The output DocumentBatch receives a composite task_id formed by appending the stage name to the original task ID: f"{task.task_id}_{self.name}". This ensures traceability through multi-stage pipelines.

Usage Examples

Basic Usage

from nemo_curator.stages.image.io.convert import ConvertImageBatchToDocumentBatchStage

# Convert with default fields (image_id only)
convert_stage = ConvertImageBatchToDocumentBatchStage()

Extracting Multiple Fields

from nemo_curator.stages.image.io.convert import ConvertImageBatchToDocumentBatchStage

# Extract image_id, aesthetic_score, and nsfw_score from each ImageObject
convert_stage = ConvertImageBatchToDocumentBatchStage(
    fields=["image_id", "aesthetic_score", "nsfw_score"]
)

Related Pages

Environment:NVIDIA_NeMo_Curator_Python_Linux_Base
NVIDIA_NeMo_Curator_ImageFilterBase - Base filter class producing ImageBatch outputs
NVIDIA_NeMo_Curator_ImageEmbeddingStage - Embedding stage that adds fields to ImageObject
NVIDIA_NeMo_Curator_ParquetWriter - Common downstream stage for writing DocumentBatch to Parquet

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment