Principle:Mlfoundations Open flamingo MMC4 Format Conversion
Overview
Data transformation pipeline that converts the MMC4 interleaved image-text dataset from its native ZIP+JSON format into WebDataset tar shards for efficient streaming during training.
Description
The MMC4 dataset stores interleaved documents as JSON files within ZIP archives, with images stored separately. For efficient streaming during distributed training, this data must be converted to WebDataset tar format where each sample is self-contained with base64-encoded images. The conversion process: reads JSON metadata from ZIP files, loads corresponding images from disk, encodes images as base64 strings embedded in the JSON, and writes batches of samples into tar shards using WebDataset's ShardWriter.
Usage
Before training when using MMC4 data; one-time preprocessing step.
Theoretical Basis
WebDataset tar format enables sequential streaming reads without random access to the filesystem, which is critical for distributed training performance. Base64 encoding images within the JSON payload makes each tar sample self-contained, eliminating the need for separate image files during training. ShardWriter automatically distributes samples across tar files with a configurable number of samples per shard.
Related Pages
Implementation:Mlfoundations_Open_flamingo_Convert_mmc4_to_wds