Principle:Mlfoundations Open flamingo WebDataset Data Pipeline
Overview
Streaming data pipeline pattern using tar-archive-based sharding for efficient distributed loading of large-scale vision-language training datasets.
Description
WebDataset stores training data in tar files (shards) where each sample consists of related files (image + text). This enables streaming reads without random access, efficient distributed training (each worker reads different shards), and fault tolerance. OpenFlamingo uses two parallel pipelines: LAION (single image-text pairs) and MMC4 (interleaved multi-image documents). The pipelines apply image preprocessing, text tokenization, and dataset-specific filtering. Deterministic shard shuffling with epoch-based seeds ensures reproducibility.
Usage
When training on datasets too large to fit in memory; when distributed training requires efficient data loading without redundant reads.
Theoretical Basis
Tar-based sharding provides sequential read patterns optimal for high-throughput I/O. Each distributed rank reads a disjoint set of shards, avoiding the need for a central index. Deterministic shuffling using epoch-seeded RNG ensures the same shard ordering across restarts for reproducibility. Error-tolerant tar parsing (log_and_continue handler) prevents individual corrupt samples from stopping training.