Environment:Mlfoundations Open flamingo WebDataset Training Dependencies
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, Data_Loading |
| Last Updated | 2026-02-08 03:30 GMT |
Overview
Training-specific dependencies including WebDataset for tar-based streaming data loading, braceexpand for shard path expansion, scipy for Hungarian matching, and wandb for experiment logging.
Description
This environment extends the base OpenFlamingo dependencies with training-specific packages. WebDataset provides high-performance streaming data loading from tar archives, essential for LAION and MMC4 datasets that contain millions of image-text pairs. Braceexpand handles shell-style brace expansion for shard paths (e.g., `shard-{0000..0999}.tar`). SciPy provides the Hungarian algorithm (`linear_sum_assignment`) used for optimal image-sentence matching in MMC4 data. Weights & Biases (wandb) handles experiment tracking and optional checkpoint uploading.
Usage
Use this environment for all Distributed Training and Data Preparation workflows. It is required for loading LAION and MMC4 datasets via WebDataset pipelines, converting MMC4 archives to WebDataset format, and logging training metrics.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| Disk | High IOPS storage | WebDataset reads from tar files; SSD recommended for training |
| Network | S3 access (optional) | LAION shards can be streamed from S3 via `pipe:aws s3 cp` |
Dependencies
Python Packages
- `webdataset`
- `braceexpand`
- `torchvision`
- `scipy` (for `linear_sum_assignment` in MMC4 preprocessing)
- `wandb`
- `tqdm`
Credentials
The following environment variables may be needed:
- `WANDB_API_KEY`: Weights & Biases API key for experiment logging (when `--report_to_wandb` is set)
- `AWS_ACCESS_KEY_ID`: For S3-hosted LAION shards (when shard paths start with `s3`)
- `AWS_SECRET_ACCESS_KEY`: For S3-hosted LAION shards
Quick Install
# Install training extras via setup.py
pip install -e ".[training]"
# Or install manually
pip install webdataset braceexpand torchvision scipy wandb tqdm
# Or from requirements file
pip install -r requirements-training.txt
Code Evidence
Training extras from `setup.py:29-35`:
TRAINING = [
"wandb",
"torchvision",
"braceexpand",
"webdataset",
"tqdm",
]
S3 shard streaming from `open_flamingo/train/train.py:222-226`:
if args.laion_shards.startswith("s3"):
args.laion_shards = f"pipe:aws s3 cp {args.laion_shards} -"
if args.mmc4_shards.startswith("s3"):
args.mmc4_shards = f"pipe:aws s3 cp {args.mmc4_shards} -"
Hungarian algorithm import from `open_flamingo/train/data.py:17`:
from scipy.optimize import linear_sum_assignment
WebDataset pipeline construction from `open_flamingo/train/data.py:321-338`:
pipeline.extend([
tarfile_to_samples_nothrow,
wds.shuffle(
bufsize=_SAMPLE_SHUFFLE_SIZE,
initial=_SAMPLE_SHUFFLE_INITIAL,
),
])
pipeline.extend([
wds.to_tuple("json", handler=log_and_continue),
wds.map(preprocess_fn, handler=log_and_continue),
wds.batched(args.batch_size_mmc4, partial=False),
])
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `number of shards must be >= total workers` | Too few WebDataset shards for the number of workers | Increase shard count or decrease `--workers * --world_size` |
| `number of samples per epoch must be equal for mmc4 and laion` | Mismatched `train_num_samples` / `batch_size` ratios | Ensure `train_num_samples_laion // batch_size_laion == train_num_samples_mmc4 // batch_size_mmc4` |
| `save_checkpoints_to_wandb requires report_to_wandb` | Trying to save checkpoints to wandb without enabling wandb | Add `--report_to_wandb` flag |
| `RuntimeError: Currently, number of dataset samples must be specified` | Missing `--train_num_samples_*` args | Pass `--train_num_samples_mmc4` and `--train_num_samples_laion` |
Compatibility Notes
- S3 streaming: LAION shards can be read directly from S3 using `pipe:aws s3 cp`. Requires AWS CLI and credentials.
- Braceexpand: Shard paths use shell-style brace expansion (e.g., `shard-{0000..0999}.tar`).
- MMC4 conversion: The `convert_mmc4_to_wds.py` script requires raw MMC4 ZIP archives and pre-downloaded images.