Principle:OpenGVLab InternVL Vision Language Model Loading
| Knowledge Sources | |
|---|---|
| Domains | Vision_Language, Model_Architecture, Deep_Learning |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
A model initialization strategy for composite vision-language models that loads and connects a vision encoder, MLP projector, and language model from pretrained checkpoints.
Description
Vision-language model loading is the process of instantiating a multimodal model from pretrained weights. For models with composite architectures (separate vision encoder, projector, and language model), this involves either:
- Path A (Unified): Loading a complete model from a single checkpoint directory containing all three components
- Path B (Assembly): Loading separate pretrained components (e.g., InternViT + InternLM2) and assembling them with a randomly-initialized projector
Path A is used for fine-tuning previously trained models. Path B is used for initial pretraining (Stage 1) where the MLP projector must be trained from scratch to bridge the vision and language representations.
After loading, the model's submodules can be selectively frozen or unfrozen to control which parameters are updated during training.
Usage
Use this principle when initializing an InternVL model for any training or inference task. Choose Path A when starting from an existing InternVL checkpoint, and Path B when assembling a new model from separately pretrained vision and language components.
Theoretical Basis
The composite architecture consists of three submodules:
# Pseudo-code: Composite VLM architecture
class VisionLanguageModel:
vision_model: ViT # Encodes images to visual features
mlp_projector: MLP # Maps visual features to LLM embedding space
language_model: LLM # Generates text conditioned on visual + text tokens
def forward(pixel_values, input_ids):
# 1. Extract visual features
vit_embeds = vision_model(pixel_values) # [B, N_patches, D_vit]
# 2. Project to LLM space with pixel shuffle downsampling (4:1)
vit_embeds = pixel_shuffle(vit_embeds, scale=0.5) # [B, N/4, D_llm]
vit_embeds = mlp_projector(vit_embeds)
# 3. Replace image placeholder tokens with visual embeddings
input_embeds = language_model.embed_tokens(input_ids)
input_embeds[image_positions] = vit_embeds
# 4. Forward through LLM
return language_model(inputs_embeds=input_embeds)
The MLP projector uses a 2-layer architecture with pixel shuffle downsampling that reduces spatial resolution by 4x before projecting to the LLM dimension.