Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:OpenGVLab InternVL Multimodal Training Sampling

From Leeroopedia


Knowledge Sources
Domains Training, Multimodal Models, Data Loading
Last Updated 2026-02-07 14:00 GMT

Overview

The multimodal training sampling principle defines a modality-aware data sampling strategy that groups training samples by modality (image vs. text-only) and length for efficient batch construction in vision-language model training.

Description

When training multimodal models on datasets containing both image-text and text-only samples, naive random sampling can lead to inefficient batching: image samples tend to be longer (due to image token expansion) and mixing them randomly with short text-only samples wastes compute on padding.

This principle addresses this through modality-length-grouped sampling:

  1. Separate by modality: Samples are split into multimodal (positive lengths) and language-only (negative lengths) groups based on their modality_lengths attribute.
  2. Sort by length within modality: Each group is independently sorted using length-grouped indices to create megabatches (world_size x batch_size) of similar-length samples.
  3. Interleave and shuffle: The megabatches from both modalities are combined (excluding partial final batches, which are merged), randomly shuffled, then flattened into a single index sequence.

This ensures that each batch contains samples of similar modality and length, minimizing padding waste while maintaining randomness across epochs. The approach also handles DeepSpeed ZeRO-3 compatible checkpointing by gathering parameters from distributed partitions before saving.

Usage

Apply this principle when training multimodal models on mixed datasets containing both image-text and text-only samples, to improve training efficiency through modality-aware batching.

Theoretical Basis

Length-grouped sampling is a standard technique for efficient sequence model training. The multimodal extension ensures that the different compute profiles of image-text (longer, vision encoder overhead) and text-only (shorter, no vision) samples are handled separately for optimal GPU utilization.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment