Principle:OpenGVLab InternVL Weighted Multi Dataset Training
| Knowledge Sources | |
|---|---|
| Domains | Model Training, Data Sampling, Multi-Dataset Learning |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
Weighted Multi-Dataset Training is the principle of combining multiple heterogeneous training datasets with configurable sampling weights to balance representation across data sources during multimodal model training.
Description
When training large multimodal models, it is common to draw from many different data sources (captioning datasets, VQA datasets, instruction-following data, OCR data, etc.) that vary significantly in size, quality, and domain. Naive concatenation would cause the model to be dominated by the largest datasets while under-training on smaller but potentially higher-quality ones.
This principle addresses the problem through:
- Square-root proportional weighting: Dataset sampling weights are computed as the square root of each dataset's size, normalized to sum to 1. This sub-linear scaling ensures that large datasets are down-weighted relative to their size while small datasets receive proportionally more training signal.
- WeightedRandomSampler: PyTorch's WeightedRandomSampler draws indices with replacement according to the computed weights, creating an effective training distribution that differs from the raw data distribution.
- Repeat-time oversampling: Each dataset can specify a repeat_time multiplier, allowing explicit control over how many times a dataset appears in the combined pool before weighting is applied.
- Meta-file configuration: Dataset composition is specified declaratively in a JSON meta file, making it easy to add, remove, or re-weight datasets without changing code.
The combined approach of repeat_time (explicit oversampling) and sqrt-proportional weighting (implicit balancing) gives practitioners fine-grained control over the training data distribution.
Usage
Apply this principle when training multimodal models on multiple datasets of varying sizes. It is especially important for InternVL training where dozens of datasets spanning different tasks are combined into a single training run.
Theoretical Basis
The square-root weighting strategy is inspired by temperature-based sampling used in multilingual NLP, where sampling proportional to p^(1/T) with T>1 up-samples low-resource languages. Using sqrt (equivalent to T=2) provides a middle ground between uniform sampling (all datasets equal) and proportional sampling (dominated by large datasets). This helps prevent catastrophic forgetting of capabilities learned from smaller specialized datasets while still leveraging the full scale of larger datasets.