Principle:Mlfoundations Open flamingo WebDataset Data Pipeline

Overview

Streaming data pipeline pattern using tar-archive-based sharding for efficient distributed loading of large-scale vision-language training datasets.

Description

WebDataset stores training data in tar files (shards) where each sample consists of related files (image + text). This enables streaming reads without random access, efficient distributed training (each worker reads different shards), and fault tolerance. OpenFlamingo uses two parallel pipelines: LAION (single image-text pairs) and MMC4 (interleaved multi-image documents). The pipelines apply image preprocessing, text tokenization, and dataset-specific filtering. Deterministic shard shuffling with epoch-based seeds ensures reproducibility.

Usage

When training on datasets too large to fit in memory; when distributed training requires efficient data loading without redundant reads.

Theoretical Basis

Tar-based sharding provides sequential read patterns optimal for high-throughput I/O. Each distributed rank reads a disjoint set of shards, avoiding the need for a central index. Deterministic shuffling using epoch-seeded RNG ensures the same shard ordering across restarts for reproducibility. Error-tolerant tar parsing (log_and_continue handler) prevents individual corrupt samples from stopping training.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment