Implementation:NVIDIA NeMo Curator ImageWriterStage
| Metadata | |
|---|---|
| Knowledge Sources | N/A |
| Domains | Data_Curation, Image_Processing, Data_Engineering |
| Last Updated | 2026-02-14 |
Overview
ImageWriterStage is a processing stage that persists curated images to WebDataset tar archives with corresponding Parquet metadata files, producing FileGroupTask outputs referencing the written files.
Description
ImageWriterStage is a dataclass-based processing stage that implements the ProcessingStage[ImageBatch, FileGroupTask] interface. It takes ImageBatch objects containing curated image data and writes the images into WebDataset tar archives in a configurable output directory. Each tar shard contains up to images_per_tar images, and corresponding Parquet metadata files are generated alongside the tar archives. The stage supports deterministic naming of output shards for reproducibility, and optionally removes image data from memory after writing. The output is a FileGroupTask containing paths to the generated .tar and .parquet files.
Usage
Use ImageWriterStage as the final stage in an image curation pipeline to persist curated images to disk. Set output_dir to the desired output directory. Adjust images_per_tar based on downstream data loading requirements. Enable deterministic_name for reproducible output file naming across pipeline runs.
Code Reference
Source Location
nemo_curator/stages/image/io/image_writer.py, lines 33-215.
Signature
@dataclass
class ImageWriterStage(ProcessingStage[ImageBatch, FileGroupTask]):
output_dir: str
images_per_tar: int = 1000
verbose: bool = False
deterministic_name: bool = True
remove_image_data: bool = False
name: str = "image_writer"
Import
from nemo_curator.stages.image.io.image_writer import ImageWriterStage
I/O Contract
| Direction | Type | Description |
|---|---|---|
| Input | ImageBatch
|
An ImageBatch containing ImageObject instances with image_data as NumPy arrays in [H, W, C] RGB format and associated metadata.
|
| Output | FileGroupTask
|
A FileGroupTask containing paths to the generated .tar and .parquet files written to output_dir.
|
Usage Examples
from nemo_curator.stages.image.io.image_writer import ImageWriterStage
# Create writer stage with default settings
writer = ImageWriterStage(
output_dir="/data/curated_output/",
images_per_tar=1000,
deterministic_name=True,
)
# Create writer stage with custom shard size and memory optimization
writer_custom = ImageWriterStage(
output_dir="/data/curated_output_v2/",
images_per_tar=500,
verbose=True,
deterministic_name=True,
remove_image_data=True,
name="final_image_writer",
)