Principle:OpenGVLab InternVL Vision Language Model Loading

Knowledge Sources	InternVL 2.5 InternVL 1.0 InternVL
Domains	Vision_Language, Model_Architecture, Deep_Learning
Last Updated	2026-02-07 00:00 GMT

Overview

A model initialization strategy for composite vision-language models that loads and connects a vision encoder, MLP projector, and language model from pretrained checkpoints.

Description

Vision-language model loading is the process of instantiating a multimodal model from pretrained weights. For models with composite architectures (separate vision encoder, projector, and language model), this involves either:

Path A (Unified): Loading a complete model from a single checkpoint directory containing all three components
Path B (Assembly): Loading separate pretrained components (e.g., InternViT + InternLM2) and assembling them with a randomly-initialized projector

Path A is used for fine-tuning previously trained models. Path B is used for initial pretraining (Stage 1) where the MLP projector must be trained from scratch to bridge the vision and language representations.

After loading, the model's submodules can be selectively frozen or unfrozen to control which parameters are updated during training.

Usage

Use this principle when initializing an InternVL model for any training or inference task. Choose Path A when starting from an existing InternVL checkpoint, and Path B when assembling a new model from separately pretrained vision and language components.

Theoretical Basis

The composite architecture consists of three submodules:

# Pseudo-code: Composite VLM architecture
class VisionLanguageModel:
    vision_model: ViT        # Encodes images to visual features
    mlp_projector: MLP        # Maps visual features to LLM embedding space
    language_model: LLM       # Generates text conditioned on visual + text tokens

    def forward(pixel_values, input_ids):
        # 1. Extract visual features
        vit_embeds = vision_model(pixel_values)  # [B, N_patches, D_vit]

        # 2. Project to LLM space with pixel shuffle downsampling (4:1)
        vit_embeds = pixel_shuffle(vit_embeds, scale=0.5)  # [B, N/4, D_llm]
        vit_embeds = mlp_projector(vit_embeds)

        # 3. Replace image placeholder tokens with visual embeddings
        input_embeds = language_model.embed_tokens(input_ids)
        input_embeds[image_positions] = vit_embeds

        # 4. Forward through LLM
        return language_model(inputs_embeds=input_embeds)

The MLP projector uses a 2-layer architecture with pixel shuffle downsampling that reduces spatial resolution by 4x before projecting to the LLM dimension.

Related Pages

Implemented By

Implementation:OpenGVLab_InternVL_InternVLChatModel_From_Pretrained

Uses Heuristic

Heuristic:OpenGVLab_InternVL_Pixel_Shuffle_Downsampling

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment