Principle:OpenGVLab InternVL Multimodal Data Preparation
| Knowledge Sources | |
|---|---|
| Domains | Vision_Language, Data_Engineering, Preprocessing |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
A data preparation strategy for multimodal vision-language models that lazily loads and preprocesses heterogeneous datasets containing images, videos, and text conversations into a unified token format.
Description
Multimodal data preparation addresses the challenge of unifying diverse data types (images, videos, multi-turn conversations) into a single training pipeline. The core idea is lazy loading: rather than preprocessing all data upfront, each sample is loaded and transformed on-demand during training. This enables training on datasets that are too large to fit in memory.
The approach handles:
- Image data: Static images processed through dynamic resolution tiling
- Video data: Video frames sampled at configurable rates and processed as multi-image sequences
- Multi-turn conversations: Formatted using conversation templates specific to the LLM backend
- Mixed datasets: Multiple datasets with different characteristics are combined via a meta-file that specifies dataset paths, weights, and augmentation settings
The key innovation is supporting dynamic image resolution where images are tiled into aspect-ratio-aware patches (1 to N tiles) rather than resized to a fixed resolution, preserving fine-grained visual details.
Usage
Use this principle when building training pipelines for vision-language models that must handle heterogeneous multimodal data. It is the appropriate strategy when:
- Training data includes both images and videos
- Datasets vary in format and require different preprocessing
- Memory constraints require lazy loading rather than full dataset materialization
- Dynamic resolution (multi-tile) processing is needed for high-fidelity image understanding
Theoretical Basis
The multimodal data preparation pipeline implements a lazy evaluation pattern common in large-scale data processing:
# Pseudo-code: Lazy multimodal data loading
for sample in dataset:
if sample.has_image:
tiles = dynamic_tile(sample.image, min_tiles=1, max_tiles=12)
pixel_values = transform(tiles)
image_tokens = [IMG_START] + [IMG_CONTEXT] * num_tokens_per_tile * len(tiles) + [IMG_END]
elif sample.has_video:
frames = sample_frames(sample.video, min_frames=8, max_frames=32)
pixel_values = transform(frames)
image_tokens = [IMG_START] + [IMG_CONTEXT] * num_tokens_per_tile * len(frames) + [IMG_END]
text_tokens = format_conversation(sample.conversations, template=llm_template)
input_ids = tokenize(text_tokens.replace(IMG_PLACEHOLDER, image_tokens))
labels = mask_human_turns(input_ids)
The dataset meta-file defines a mixture of datasets with weighted sampling:
# Meta-file structure (JSON)
{
"dataset_name": {
"root": "/path/to/images",
"annotation": "/path/to/annotations.jsonl",
"data_augment": false,
"repeat_time": 1,
"length": 100000
}
}