Principle:NVIDIA NeMo Curator Image Export
| Metadata | |
|---|---|
| Knowledge Sources | N/A |
| Domains | Data_Curation, Image_Processing, Data_Engineering |
| Last Updated | 2026-02-14 |
Overview
Image Export is a technique for persisting curated images to WebDataset tar archives with corresponding Parquet metadata files, enabling efficient storage and retrieval of processed image datasets.
Description
Image Export in NeMo Curator packs curated images into WebDataset tar archives with a configurable number of images per shard. Alongside the tar archives, the stage writes Parquet metadata files that capture per-image information such as identifiers, embedding vectors, quality scores, and other attributes accumulated during the curation pipeline. This dual-format output combines the sequential I/O efficiency of tar archives for image data with the analytical query capabilities of Parquet for metadata, providing a versatile output format suitable for both model training data loaders and data analysis workflows.
Usage
Use Image Export as the final stage in an image curation pipeline to write the curated dataset to persistent storage. This stage should be applied after all filtering, deduplication, and quality assessment stages have been completed. Configure the shard size (images per tar) based on the expected downstream consumption pattern: larger shards are more efficient for sequential training data loading, while smaller shards provide finer-grained parallelism for distributed training.
Theoretical Basis
Image Export is built on two complementary storage formats optimized for different access patterns. The WebDataset tar format stores images as sequential entries in tar archives, which is optimized for streaming I/O and provides high throughput for sequential data loading in model training. The tar format avoids the overhead of individual file system operations for each image and enables efficient prefetching and buffering. The Parquet format is used for metadata storage because it provides efficient columnar storage with compression, enabling fast analytical queries over image metadata without requiring access to the image data itself. The combination of tar for image data and Parquet for metadata provides a complete and efficient representation of curated image datasets. Deterministic naming of output shards ensures reproducibility and simplifies downstream data management.