Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:NVIDIA NeMo Curator ImageWriterStage

From Leeroopedia
Metadata
Knowledge Sources N/A
Domains Data_Curation, Image_Processing, Data_Engineering
Last Updated 2026-02-14

Overview

ImageWriterStage is a processing stage that persists curated images to WebDataset tar archives with corresponding Parquet metadata files, producing FileGroupTask outputs referencing the written files.

Description

ImageWriterStage is a dataclass-based processing stage that implements the ProcessingStage[ImageBatch, FileGroupTask] interface. It takes ImageBatch objects containing curated image data and writes the images into WebDataset tar archives in a configurable output directory. Each tar shard contains up to images_per_tar images, and corresponding Parquet metadata files are generated alongside the tar archives. The stage supports deterministic naming of output shards for reproducibility, and optionally removes image data from memory after writing. The output is a FileGroupTask containing paths to the generated .tar and .parquet files.

Usage

Use ImageWriterStage as the final stage in an image curation pipeline to persist curated images to disk. Set output_dir to the desired output directory. Adjust images_per_tar based on downstream data loading requirements. Enable deterministic_name for reproducible output file naming across pipeline runs.

Code Reference

Source Location

nemo_curator/stages/image/io/image_writer.py, lines 33-215.

Signature

@dataclass
class ImageWriterStage(ProcessingStage[ImageBatch, FileGroupTask]):
    output_dir: str
    images_per_tar: int = 1000
    verbose: bool = False
    deterministic_name: bool = True
    remove_image_data: bool = False
    name: str = "image_writer"

Import

from nemo_curator.stages.image.io.image_writer import ImageWriterStage

I/O Contract

Direction Type Description
Input ImageBatch An ImageBatch containing ImageObject instances with image_data as NumPy arrays in [H, W, C] RGB format and associated metadata.
Output FileGroupTask A FileGroupTask containing paths to the generated .tar and .parquet files written to output_dir.

Usage Examples

from nemo_curator.stages.image.io.image_writer import ImageWriterStage

# Create writer stage with default settings
writer = ImageWriterStage(
    output_dir="/data/curated_output/",
    images_per_tar=1000,
    deterministic_name=True,
)

# Create writer stage with custom shard size and memory optimization
writer_custom = ImageWriterStage(
    output_dir="/data/curated_output_v2/",
    images_per_tar=500,
    verbose=True,
    deterministic_name=True,
    remove_image_data=True,
    name="final_image_writer",
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment