Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:OpenGVLab InternVL Vision Language Model Loading

From Leeroopedia


Knowledge Sources
Domains Vision_Language, Model_Architecture, Deep_Learning
Last Updated 2026-02-07 00:00 GMT

Overview

A model initialization strategy for composite vision-language models that loads and connects a vision encoder, MLP projector, and language model from pretrained checkpoints.

Description

Vision-language model loading is the process of instantiating a multimodal model from pretrained weights. For models with composite architectures (separate vision encoder, projector, and language model), this involves either:

  • Path A (Unified): Loading a complete model from a single checkpoint directory containing all three components
  • Path B (Assembly): Loading separate pretrained components (e.g., InternViT + InternLM2) and assembling them with a randomly-initialized projector

Path A is used for fine-tuning previously trained models. Path B is used for initial pretraining (Stage 1) where the MLP projector must be trained from scratch to bridge the vision and language representations.

After loading, the model's submodules can be selectively frozen or unfrozen to control which parameters are updated during training.

Usage

Use this principle when initializing an InternVL model for any training or inference task. Choose Path A when starting from an existing InternVL checkpoint, and Path B when assembling a new model from separately pretrained vision and language components.

Theoretical Basis

The composite architecture consists of three submodules:

# Pseudo-code: Composite VLM architecture
class VisionLanguageModel:
    vision_model: ViT        # Encodes images to visual features
    mlp_projector: MLP        # Maps visual features to LLM embedding space
    language_model: LLM       # Generates text conditioned on visual + text tokens

    def forward(pixel_values, input_ids):
        # 1. Extract visual features
        vit_embeds = vision_model(pixel_values)  # [B, N_patches, D_vit]

        # 2. Project to LLM space with pixel shuffle downsampling (4:1)
        vit_embeds = pixel_shuffle(vit_embeds, scale=0.5)  # [B, N/4, D_llm]
        vit_embeds = mlp_projector(vit_embeds)

        # 3. Replace image placeholder tokens with visual embeddings
        input_embeds = language_model.embed_tokens(input_ids)
        input_embeds[image_positions] = vit_embeds

        # 4. Forward through LLM
        return language_model(inputs_embeds=input_embeds)

The MLP projector uses a 2-layer architecture with pixel shuffle downsampling that reduces spatial resolution by 4x before projecting to the LLM dimension.

Related Pages

Implemented By

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment