Principle:OpenGVLab InternVL Multimodal Data Preparation

Knowledge Sources	InternVL 2.5 InternVL
Domains	Vision_Language, Data_Engineering, Preprocessing
Last Updated	2026-02-07 00:00 GMT

Overview

A data preparation strategy for multimodal vision-language models that lazily loads and preprocesses heterogeneous datasets containing images, videos, and text conversations into a unified token format.

Description

Multimodal data preparation addresses the challenge of unifying diverse data types (images, videos, multi-turn conversations) into a single training pipeline. The core idea is lazy loading: rather than preprocessing all data upfront, each sample is loaded and transformed on-demand during training. This enables training on datasets that are too large to fit in memory.

The approach handles:

Image data: Static images processed through dynamic resolution tiling
Video data: Video frames sampled at configurable rates and processed as multi-image sequences
Multi-turn conversations: Formatted using conversation templates specific to the LLM backend
Mixed datasets: Multiple datasets with different characteristics are combined via a meta-file that specifies dataset paths, weights, and augmentation settings

The key innovation is supporting dynamic image resolution where images are tiled into aspect-ratio-aware patches (1 to N tiles) rather than resized to a fixed resolution, preserving fine-grained visual details.

Usage

Use this principle when building training pipelines for vision-language models that must handle heterogeneous multimodal data. It is the appropriate strategy when:

Training data includes both images and videos
Datasets vary in format and require different preprocessing
Memory constraints require lazy loading rather than full dataset materialization
Dynamic resolution (multi-tile) processing is needed for high-fidelity image understanding

Theoretical Basis

The multimodal data preparation pipeline implements a lazy evaluation pattern common in large-scale data processing:

# Pseudo-code: Lazy multimodal data loading
for sample in dataset:
    if sample.has_image:
        tiles = dynamic_tile(sample.image, min_tiles=1, max_tiles=12)
        pixel_values = transform(tiles)
        image_tokens = [IMG_START] + [IMG_CONTEXT] * num_tokens_per_tile * len(tiles) + [IMG_END]
    elif sample.has_video:
        frames = sample_frames(sample.video, min_frames=8, max_frames=32)
        pixel_values = transform(frames)
        image_tokens = [IMG_START] + [IMG_CONTEXT] * num_tokens_per_tile * len(frames) + [IMG_END]

    text_tokens = format_conversation(sample.conversations, template=llm_template)
    input_ids = tokenize(text_tokens.replace(IMG_PLACEHOLDER, image_tokens))
    labels = mask_human_turns(input_ids)

The dataset meta-file defines a mixture of datasets with weighted sampling:

# Meta-file structure (JSON)
{
    "dataset_name": {
        "root": "/path/to/images",
        "annotation": "/path/to/annotations.jsonl",
        "data_augment": false,
        "repeat_time": 1,
        "length": 100000
    }
}

Related Pages

Implemented By

Implementation:OpenGVLab_InternVL_LazySupervisedDataset

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment