Workflow:Mlfoundations Open flamingo Data Preparation
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, Vision_Language_Models, ETL |
| Last Updated | 2026-02-08 03:30 GMT |
Overview
End-to-end process for preparing and converting training datasets (LAION image-text pairs and MMC4 interleaved documents) into WebDataset tar shard format for efficient distributed training of OpenFlamingo models.
Description
This workflow covers the data preparation pipeline required before training OpenFlamingo models. The training pipeline consumes two types of data in WebDataset format: LAION-style single image-text pairs and MMC4-style interleaved multi-image documents. LAION data is typically already available in WebDataset format, while MMC4 data must be converted from its original ZIP+JSON+images format into WebDataset tars with base64-encoded images. The workflow also covers the data loading pipeline that applies image preprocessing, text tokenization with Flamingo special tokens, and image-text alignment filtering during training.
Usage
Execute this workflow when you are setting up training data for the first time or need to convert new MMC4 shards for OpenFlamingo training. You need the raw MMC4 dataset (ZIP archives of JSON documents plus downloaded images) and optionally LAION shards. The conversion produces WebDataset tar files that can be read efficiently by the training pipeline.
Execution Steps
Step 1: Obtain Raw Datasets
Download the source datasets needed for OpenFlamingo training. LAION image-text data is typically available as pre-built WebDataset shards from the LAION project. MMC4 (Multimodal C4) data consists of ZIP archives containing JSON documents that describe interleaved image-text sequences, plus separately downloaded images organized by shard index.
Key considerations:
- LAION shards should contain image files (jpg/png) and text files per sample
- MMC4 ZIP archives contain JSON with text and image metadata (image_info with image_name fields)
- MMC4 images must be downloaded separately and organized by shard index
- Both datasets can be stored on S3 (the training script supports pipe:aws s3 cp)
Step 2: Convert MMC4 To WebDataset
Run the convert_mmc4_to_wds.py script to transform MMC4 ZIP+JSON+images into WebDataset tar shards. The script iterates through each ZIP shard, extracts the JSON documents, loads the corresponding images from disk, encodes them as base64 strings, and writes the combined JSON (with embedded base64 images) into WebDataset tar files using the wds.ShardWriter.
Key considerations:
- The script takes --zip_files (brace-expandable glob), --image_dir, and --output_dir arguments
- Each output tar shard contains num_files_per_shard samples (default 1000)
- Images that were not successfully downloaded (404) are handled gracefully with a warning
- The base64 encoding embeds images directly in the JSON, avoiding external file references
- Each sample gets a unique UUID key for WebDataset compatibility
Step 3: Validate LAION Shards
Verify that LAION WebDataset shards are properly formatted for the training pipeline. Each shard should be a tar archive where each sample contains an image file (jpg, png, or jpeg) and a text file (txt) with the caption. Samples missing either an image or caption are filtered out during training by the filter_no_caption_or_no_image function.
Key considerations:
- The training pipeline expects standard WebDataset tar format
- Samples must have both an image and a txt file
- The brace expansion pattern (e.g., shard-{0000..0999}.tar) must match the actual shard files
- Shard counts should be balanced with the training configuration (samples per epoch)
Step 4: Configure Data Loading Pipeline
Set up the data loading parameters that control how the WebDataset pipelines process data during training. The LAION pipeline applies image preprocessing (CLIP processor + random horizontal flip) and text tokenization (captions wrapped in Flamingo special tokens, truncated to 32 tokens). The MMC4 pipeline decodes base64 images, applies similarity-based filtering to select relevant images per document, and constructs interleaved sequences respecting configurable minimum and maximum image counts.
Key considerations:
- LAION text format: "<image>{caption}<|endofchunk|>{eos_token}", max 32 tokens
- MMC4 uses image-text similarity threshold (mmc4_textsim_threshold) for filtering
- MMC4 image-text alignment uses the Hungarian algorithm (scipy linear_sum_assignment)
- Max images per MMC4 sequence is configurable (mmc4_max_num_images, default 6)
- Shard shuffling uses deterministic seeds for reproducibility across workers and epochs
- Data loading supports resampled mode for infinite iteration