Implementation:NVIDIA NeMo Curator Image Convert Stage
| Knowledge Sources | |
|---|---|
| Domains | Image Processing, Data Conversion, Pipeline Integration |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
ConvertImageBatchToDocumentBatchStage is a bridge stage that converts an ImageBatch into a DocumentBatch, enabling image pipeline outputs to flow into document-oriented downstream stages such as Parquet writers.
Description
ConvertImageBatchToDocumentBatchStage extends ProcessingStage[ImageBatch, DocumentBatch] as a Python dataclass. Its primary purpose is to extract specified fields from each ImageObject in an ImageBatch and construct a pandas DataFrame that is then wrapped into a DocumentBatch.
When the fields parameter is populated, the stage iterates over each ImageObject and extracts the named attributes using getattr, building a dictionary of lists. If no fields are specified, it defaults to extracting just image_id from each object. The resulting dictionary is converted into a pd.DataFrame and wrapped in a DocumentBatch that preserves the original task metadata including task_id, dataset_name, _metadata, and _stage_perf.
Usage
Use this stage at the boundary between image processing pipelines and document processing pipelines. It is typically placed after image filtering or embedding stages when the downstream stages require tabular DocumentBatch data (e.g., writing image metadata to Parquet format).
Code Reference
Source Location
- Repository: NeMo-Curator
- File:
nemo_curator/stages/image/io/convert.py - Lines: 1-53
Signature
@dataclass
class ConvertImageBatchToDocumentBatchStage(ProcessingStage[ImageBatch, DocumentBatch]):
fields: list[str] = field(default_factory=list)
name: str = "convert_image_batch_to_document_batch"
def process(self, task: ImageBatch) -> DocumentBatch: ...
Import
from nemo_curator.stages.image.io.convert import ConvertImageBatchToDocumentBatchStage
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| fields | list[str] |
No | List of ImageObject attribute names to extract. Defaults to empty list, which causes the stage to extract only image_id.
|
| task | ImageBatch |
Yes | The image batch containing a list of ImageObject instances to convert
|
Outputs
| Name | Type | Description |
|---|---|---|
| result | DocumentBatch |
A document batch containing a pandas DataFrame with columns corresponding to the extracted fields, preserving task metadata |
Key Implementation Details
Field Extraction Logic
The process() method handles two cases for field extraction:
def process(self, task: ImageBatch) -> DocumentBatch:
data = {}
if self.fields:
for field in self.fields:
data[field] = [getattr(image_obj, field, None) for image_obj in task.data]
else:
# Default to image_id if no fields specified
data["image_id"] = [image_obj.image_id for image_obj in task.data]
df = pd.DataFrame(data)
return DocumentBatch(
task_id=f"{task.task_id}_{self.name}",
dataset_name=task.dataset_name,
data=df,
_metadata=task._metadata,
_stage_perf=task._stage_perf,
)
When a field does not exist on an ImageObject, getattr returns None, so the resulting DataFrame may contain null values for missing attributes.
Task ID Propagation
The output DocumentBatch receives a composite task_id formed by appending the stage name to the original task ID: f"{task.task_id}_{self.name}". This ensures traceability through multi-stage pipelines.
Usage Examples
Basic Usage
from nemo_curator.stages.image.io.convert import ConvertImageBatchToDocumentBatchStage
# Convert with default fields (image_id only)
convert_stage = ConvertImageBatchToDocumentBatchStage()
Extracting Multiple Fields
from nemo_curator.stages.image.io.convert import ConvertImageBatchToDocumentBatchStage
# Extract image_id, aesthetic_score, and nsfw_score from each ImageObject
convert_stage = ConvertImageBatchToDocumentBatchStage(
fields=["image_id", "aesthetic_score", "nsfw_score"]
)
Related Pages
- Environment:NVIDIA_NeMo_Curator_Python_Linux_Base
- NVIDIA_NeMo_Curator_ImageFilterBase - Base filter class producing ImageBatch outputs
- NVIDIA_NeMo_Curator_ImageEmbeddingStage - Embedding stage that adds fields to ImageObject
- NVIDIA_NeMo_Curator_ParquetWriter - Common downstream stage for writing DocumentBatch to Parquet