Principle:Mlfoundations Open flamingo Dual Dataset Training Loop
Overview
Training strategy that alternates between single image-text pairs (LAION) and interleaved multi-image sequences (MMC4) within each training step to learn both basic captioning and complex in-context learning abilities.
Description
The Flamingo training loop processes both LAION and MMC4 data in each step. LAION provides single image-caption pairs for learning basic visual grounding. MMC4 provides multi-image interleaved documents for learning in-context learning capabilities. The loss is computed as cross-entropy only on text tokens that follow image tokens (loss masking), preventing the model from learning to copy text patterns unrelated to visual content. Gradient accumulation, mixed precision autocast, and gradient clipping (max norm 1.0) are applied.
Usage
When training an OpenFlamingo model to learn both visual grounding and few-shot in-context learning.
Theoretical Basis
Loss masking is critical: the model should only be supervised on text that depends on visual input. For LAION, loss is computed on caption tokens following the <image> token. For MMC4, loss is computed on text tokens within each image-text chunk (between <image> and <|endofchunk|>). The dual-dataset approach provides complementary training signals: LAION teaches object-level recognition while MMC4 teaches document-level reasoning across multiple images.