Implementation:Mlfoundations Open flamingo Convert mmc4 to wds
Overview
Concrete tool for converting MMC4 ZIP archives into WebDataset tar shards with base64-encoded images provided by the OpenFlamingo scripts module.
Description
The conversion script reads MMC4 ZIP files containing JSON metadata with text_list and image_info fields, loads corresponding images from a download directory, encodes images as base64, and writes the combined samples to tar files using webdataset.ShardWriter. Each output sample contains a JSON payload with text_list (list of text paragraphs) and image_info (list of dicts with image_base64 and matched_text_index fields). Supports brace expansion for input paths (e.g., shard_{0..23098}.zip).
Usage
Run as a command-line script before training.
Code Reference
Source: Repository https://github.com/mlfoundations/open_flamingo, File: open_flamingo/scripts/convert_mmc4_to_wds.py Lines L1-85
CLI usage:
# Command-line invocation
python open_flamingo/scripts/convert_mmc4_to_wds.py \
--output_dir /path/to/output \
--zip_files "/path/to/mmc4/shard_{0..23098}.zip" \
--image_dir /path/to/downloaded/images \
--num_files_per_shard 1000
Key internal API:
# Uses webdataset ShardWriter internally
writer = wds.ShardWriter(
pattern=os.path.join(args.output_dir, "%09d.tar"),
maxcount=args.num_files_per_shard,
)
writer.write({"__key__": str(uuid4()), "json": json_str})
Import: Script run directly via python open_flamingo/scripts/convert_mmc4_to_wds.py
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
--output_dir |
str | Yes | Directory for output tar shards |
--zip_files |
str | Yes | Glob/brace pattern for MMC4 ZIPs |
--image_dir |
str | Yes | Directory with downloaded images |
--num_files_per_shard |
int | No | Samples per tar (default 1000) |
Outputs
WebDataset tar files at {output_dir}/%09d.tar with samples containing:
{
"__key__": "uuid",
"json": {
"text_list": ["..."],
"image_info": [
{
"image_base64": "...",
"matched_text_index": 0
}
]
}
}
Usage Examples
Full conversion command for processing the entire MMC4 dataset:
python open_flamingo/scripts/convert_mmc4_to_wds.py \
--output_dir /data/mmc4_wds/ \
--zip_files "/data/mmc4_raw/shard_{0..23098}.zip" \
--image_dir /data/mmc4_images/ \
--num_files_per_shard 1000
Related Pages
Principle:Mlfoundations_Open_flamingo_MMC4_Format_Conversion