Principle:OpenGVLab InternVL Multimodal Training Sampling
| Knowledge Sources | |
|---|---|
| Domains | Training, Multimodal Models, Data Loading |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
The multimodal training sampling principle defines a modality-aware data sampling strategy that groups training samples by modality (image vs. text-only) and length for efficient batch construction in vision-language model training.
Description
When training multimodal models on datasets containing both image-text and text-only samples, naive random sampling can lead to inefficient batching: image samples tend to be longer (due to image token expansion) and mixing them randomly with short text-only samples wastes compute on padding.
This principle addresses this through modality-length-grouped sampling:
- Separate by modality: Samples are split into multimodal (positive lengths) and language-only (negative lengths) groups based on their modality_lengths attribute.
- Sort by length within modality: Each group is independently sorted using length-grouped indices to create megabatches (world_size x batch_size) of similar-length samples.
- Interleave and shuffle: The megabatches from both modalities are combined (excluding partial final batches, which are merged), randomly shuffled, then flattened into a single index sequence.
This ensures that each batch contains samples of similar modality and length, minimizing padding waste while maintaining randomness across epochs. The approach also handles DeepSpeed ZeRO-3 compatible checkpointing by gathering parameters from distributed partitions before saving.
Usage
Apply this principle when training multimodal models on mixed datasets containing both image-text and text-only samples, to improve training efficiency through modality-aware batching.
Theoretical Basis
Length-grouped sampling is a standard technique for efficient sequence model training. The multimodal extension ensures that the different compute profiles of image-text (longer, vision encoder overhead) and text-only (shorter, no vision) samples are handled separately for optimal GPU utilization.