Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Environment:Mlfoundations Open flamingo WebDataset Training Dependencies

From Leeroopedia


Knowledge Sources
Domains Infrastructure, Data_Loading
Last Updated 2026-02-08 03:30 GMT

Overview

Training-specific dependencies including WebDataset for tar-based streaming data loading, braceexpand for shard path expansion, scipy for Hungarian matching, and wandb for experiment logging.

Description

This environment extends the base OpenFlamingo dependencies with training-specific packages. WebDataset provides high-performance streaming data loading from tar archives, essential for LAION and MMC4 datasets that contain millions of image-text pairs. Braceexpand handles shell-style brace expansion for shard paths (e.g., `shard-{0000..0999}.tar`). SciPy provides the Hungarian algorithm (`linear_sum_assignment`) used for optimal image-sentence matching in MMC4 data. Weights & Biases (wandb) handles experiment tracking and optional checkpoint uploading.

Usage

Use this environment for all Distributed Training and Data Preparation workflows. It is required for loading LAION and MMC4 datasets via WebDataset pipelines, converting MMC4 archives to WebDataset format, and logging training metrics.

System Requirements

Category Requirement Notes
Disk High IOPS storage WebDataset reads from tar files; SSD recommended for training
Network S3 access (optional) LAION shards can be streamed from S3 via `pipe:aws s3 cp`

Dependencies

Python Packages

  • `webdataset`
  • `braceexpand`
  • `torchvision`
  • `scipy` (for `linear_sum_assignment` in MMC4 preprocessing)
  • `wandb`
  • `tqdm`

Credentials

The following environment variables may be needed:

  • `WANDB_API_KEY`: Weights & Biases API key for experiment logging (when `--report_to_wandb` is set)
  • `AWS_ACCESS_KEY_ID`: For S3-hosted LAION shards (when shard paths start with `s3`)
  • `AWS_SECRET_ACCESS_KEY`: For S3-hosted LAION shards

Quick Install

# Install training extras via setup.py
pip install -e ".[training]"

# Or install manually
pip install webdataset braceexpand torchvision scipy wandb tqdm

# Or from requirements file
pip install -r requirements-training.txt

Code Evidence

Training extras from `setup.py:29-35`:

TRAINING = [
    "wandb",
    "torchvision",
    "braceexpand",
    "webdataset",
    "tqdm",
]

S3 shard streaming from `open_flamingo/train/train.py:222-226`:

if args.laion_shards.startswith("s3"):
    args.laion_shards = f"pipe:aws s3 cp {args.laion_shards} -"

if args.mmc4_shards.startswith("s3"):
    args.mmc4_shards = f"pipe:aws s3 cp {args.mmc4_shards} -"

Hungarian algorithm import from `open_flamingo/train/data.py:17`:

from scipy.optimize import linear_sum_assignment

WebDataset pipeline construction from `open_flamingo/train/data.py:321-338`:

pipeline.extend([
    tarfile_to_samples_nothrow,
    wds.shuffle(
        bufsize=_SAMPLE_SHUFFLE_SIZE,
        initial=_SAMPLE_SHUFFLE_INITIAL,
    ),
])
pipeline.extend([
    wds.to_tuple("json", handler=log_and_continue),
    wds.map(preprocess_fn, handler=log_and_continue),
    wds.batched(args.batch_size_mmc4, partial=False),
])

Common Errors

Error Message Cause Solution
`number of shards must be >= total workers` Too few WebDataset shards for the number of workers Increase shard count or decrease `--workers * --world_size`
`number of samples per epoch must be equal for mmc4 and laion` Mismatched `train_num_samples` / `batch_size` ratios Ensure `train_num_samples_laion // batch_size_laion == train_num_samples_mmc4 // batch_size_mmc4`
`save_checkpoints_to_wandb requires report_to_wandb` Trying to save checkpoints to wandb without enabling wandb Add `--report_to_wandb` flag
`RuntimeError: Currently, number of dataset samples must be specified` Missing `--train_num_samples_*` args Pass `--train_num_samples_mmc4` and `--train_num_samples_laion`

Compatibility Notes

  • S3 streaming: LAION shards can be read directly from S3 using `pipe:aws s3 cp`. Requires AWS CLI and credentials.
  • Braceexpand: Shard paths use shell-style brace expansion (e.g., `shard-{0000..0999}.tar`).
  • MMC4 conversion: The `convert_mmc4_to_wds.py` script requires raw MMC4 ZIP archives and pre-downloaded images.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment