Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:NVIDIA NeMo Curator Image Convert Stage

From Leeroopedia
Knowledge Sources
Domains Image Processing, Data Conversion, Pipeline Integration
Last Updated 2026-02-14 00:00 GMT

Overview

ConvertImageBatchToDocumentBatchStage is a bridge stage that converts an ImageBatch into a DocumentBatch, enabling image pipeline outputs to flow into document-oriented downstream stages such as Parquet writers.

Description

ConvertImageBatchToDocumentBatchStage extends ProcessingStage[ImageBatch, DocumentBatch] as a Python dataclass. Its primary purpose is to extract specified fields from each ImageObject in an ImageBatch and construct a pandas DataFrame that is then wrapped into a DocumentBatch.

When the fields parameter is populated, the stage iterates over each ImageObject and extracts the named attributes using getattr, building a dictionary of lists. If no fields are specified, it defaults to extracting just image_id from each object. The resulting dictionary is converted into a pd.DataFrame and wrapped in a DocumentBatch that preserves the original task metadata including task_id, dataset_name, _metadata, and _stage_perf.

Usage

Use this stage at the boundary between image processing pipelines and document processing pipelines. It is typically placed after image filtering or embedding stages when the downstream stages require tabular DocumentBatch data (e.g., writing image metadata to Parquet format).

Code Reference

Source Location

  • Repository: NeMo-Curator
  • File: nemo_curator/stages/image/io/convert.py
  • Lines: 1-53

Signature

@dataclass
class ConvertImageBatchToDocumentBatchStage(ProcessingStage[ImageBatch, DocumentBatch]):
    fields: list[str] = field(default_factory=list)
    name: str = "convert_image_batch_to_document_batch"

    def process(self, task: ImageBatch) -> DocumentBatch: ...

Import

from nemo_curator.stages.image.io.convert import ConvertImageBatchToDocumentBatchStage

I/O Contract

Inputs

Name Type Required Description
fields list[str] No List of ImageObject attribute names to extract. Defaults to empty list, which causes the stage to extract only image_id.
task ImageBatch Yes The image batch containing a list of ImageObject instances to convert

Outputs

Name Type Description
result DocumentBatch A document batch containing a pandas DataFrame with columns corresponding to the extracted fields, preserving task metadata

Key Implementation Details

Field Extraction Logic

The process() method handles two cases for field extraction:

def process(self, task: ImageBatch) -> DocumentBatch:
    data = {}
    if self.fields:
        for field in self.fields:
            data[field] = [getattr(image_obj, field, None) for image_obj in task.data]
    else:
        # Default to image_id if no fields specified
        data["image_id"] = [image_obj.image_id for image_obj in task.data]
    df = pd.DataFrame(data)

    return DocumentBatch(
        task_id=f"{task.task_id}_{self.name}",
        dataset_name=task.dataset_name,
        data=df,
        _metadata=task._metadata,
        _stage_perf=task._stage_perf,
    )

When a field does not exist on an ImageObject, getattr returns None, so the resulting DataFrame may contain null values for missing attributes.

Task ID Propagation

The output DocumentBatch receives a composite task_id formed by appending the stage name to the original task ID: f"{task.task_id}_{self.name}". This ensures traceability through multi-stage pipelines.

Usage Examples

Basic Usage

from nemo_curator.stages.image.io.convert import ConvertImageBatchToDocumentBatchStage

# Convert with default fields (image_id only)
convert_stage = ConvertImageBatchToDocumentBatchStage()

Extracting Multiple Fields

from nemo_curator.stages.image.io.convert import ConvertImageBatchToDocumentBatchStage

# Extract image_id, aesthetic_score, and nsfw_score from each ImageObject
convert_stage = ConvertImageBatchToDocumentBatchStage(
    fields=["image_id", "aesthetic_score", "nsfw_score"]
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment