Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Mlfoundations Open flamingo MMC4 Format Conversion

From Leeroopedia


Template:Metadata

Overview

Data transformation pipeline that converts the MMC4 interleaved image-text dataset from its native ZIP+JSON format into WebDataset tar shards for efficient streaming during training.

Description

The MMC4 dataset stores interleaved documents as JSON files within ZIP archives, with images stored separately. For efficient streaming during distributed training, this data must be converted to WebDataset tar format where each sample is self-contained with base64-encoded images. The conversion process: reads JSON metadata from ZIP files, loads corresponding images from disk, encodes images as base64 strings embedded in the JSON, and writes batches of samples into tar shards using WebDataset's ShardWriter.

Usage

Before training when using MMC4 data; one-time preprocessing step.

Theoretical Basis

WebDataset tar format enables sequential streaming reads without random access to the filesystem, which is critical for distributed training performance. Base64 encoding images within the JSON payload makes each tar sample self-contained, eliminating the need for separate image files during training. ShardWriter automatically distributes samples across tar files with a configurable number of samples per shard.

Related Pages

Implementation:Mlfoundations_Open_flamingo_Convert_mmc4_to_wds

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment