Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Mlfoundations Open flamingo Convert mmc4 to wds

From Leeroopedia


Template:Metadata

Overview

Concrete tool for converting MMC4 ZIP archives into WebDataset tar shards with base64-encoded images provided by the OpenFlamingo scripts module.

Description

The conversion script reads MMC4 ZIP files containing JSON metadata with text_list and image_info fields, loads corresponding images from a download directory, encodes images as base64, and writes the combined samples to tar files using webdataset.ShardWriter. Each output sample contains a JSON payload with text_list (list of text paragraphs) and image_info (list of dicts with image_base64 and matched_text_index fields). Supports brace expansion for input paths (e.g., shard_{0..23098}.zip).

Usage

Run as a command-line script before training.

Code Reference

Source: Repository https://github.com/mlfoundations/open_flamingo, File: open_flamingo/scripts/convert_mmc4_to_wds.py Lines L1-85

CLI usage:

# Command-line invocation
python open_flamingo/scripts/convert_mmc4_to_wds.py \
    --output_dir /path/to/output \
    --zip_files "/path/to/mmc4/shard_{0..23098}.zip" \
    --image_dir /path/to/downloaded/images \
    --num_files_per_shard 1000

Key internal API:

# Uses webdataset ShardWriter internally
writer = wds.ShardWriter(
    pattern=os.path.join(args.output_dir, "%09d.tar"),
    maxcount=args.num_files_per_shard,
)
writer.write({"__key__": str(uuid4()), "json": json_str})

Import: Script run directly via python open_flamingo/scripts/convert_mmc4_to_wds.py

I/O Contract

Inputs

Name Type Required Description
--output_dir str Yes Directory for output tar shards
--zip_files str Yes Glob/brace pattern for MMC4 ZIPs
--image_dir str Yes Directory with downloaded images
--num_files_per_shard int No Samples per tar (default 1000)

Outputs

WebDataset tar files at {output_dir}/%09d.tar with samples containing:

{
    "__key__": "uuid",
    "json": {
        "text_list": ["..."],
        "image_info": [
            {
                "image_base64": "...",
                "matched_text_index": 0
            }
        ]
    }
}

Usage Examples

Full conversion command for processing the entire MMC4 dataset:

python open_flamingo/scripts/convert_mmc4_to_wds.py \
    --output_dir /data/mmc4_wds/ \
    --zip_files "/data/mmc4_raw/shard_{0..23098}.zip" \
    --image_dir /data/mmc4_images/ \
    --num_files_per_shard 1000

Related Pages

Principle:Mlfoundations_Open_flamingo_MMC4_Format_Conversion

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment