Implementation:OpenGVLab InternVL InternVLChatModel From Pretrained
Appearance
| Knowledge Sources | |
|---|---|
| Domains | Vision_Language, Model_Architecture |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Concrete tool for loading the InternVL composite vision-language model from pretrained checkpoints provided by the InternVL framework.
Description
InternVLChatModel is the central multimodal model class in InternVL, extending HuggingFace's PreTrainedModel. It combines:
- vision_model (InternViT): Vision encoder processing image tiles
- mlp1 (MLP projector): 2-layer MLP with pixel shuffle that maps vision features to LLM space
- language_model: Interchangeable LLM backbone (InternLM2, Qwen2, LLaMA, Phi-3)
The model supports two loading paths:
- from_pretrained(): Loads all components from a single checkpoint
- __init__(config, vision_model, language_model): Assembles from separate components
Usage
Import this class for any InternVL training or inference task. Use from_pretrained when loading an existing InternVL checkpoint for fine-tuning or evaluation.
Code Reference
Source Location
- Repository: InternVL
- File: internvl_chat/internvl/model/internvl_chat/modeling_internvl_chat.py
- Lines: L30-398
Signature
class InternVLChatModel(PreTrainedModel):
config_class = InternVLChatConfig
main_input_name = 'pixel_values'
base_model_prefix = 'language_model'
def __init__(
self,
config: InternVLChatConfig,
vision_model=None,
language_model=None,
use_flash_attn=True,
):
"""
Args:
config: InternVLChatConfig with vision_config, llm_config, and template
vision_model: Optional pre-instantiated InternVisionModel (Path B assembly)
language_model: Optional pre-instantiated LLM (Path B assembly)
use_flash_attn: Enable Flash Attention 2 (default True)
"""
def forward(
self,
pixel_values: torch.FloatTensor,
input_ids: torch.LongTensor = None,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
image_flags: Optional[torch.LongTensor] = None,
past_key_values: Optional[List[torch.FloatTensor]] = None,
labels: Optional[torch.LongTensor] = None,
use_cache: Optional[bool] = None,
output_attentions: Optional[bool] = None,
output_hidden_states: Optional[bool] = None,
return_dict: Optional[bool] = None,
statistics: Optional[torch.LongTensor] = None,
loss_weight: Optional[List] = None,
loss_reduction_all_gather: Optional[bool] = False,
) -> Union[Tuple, CausalLMOutputWithPast]:
def chat(
self,
tokenizer,
pixel_values,
question,
generation_config,
history=None,
return_history=False,
num_patches_list=None,
IMG_START_TOKEN='<img>',
IMG_END_TOKEN='</img>',
IMG_CONTEXT_TOKEN='<IMG_CONTEXT>',
verbose=False,
) -> str:
Import
from internvl.model.internvl_chat import InternVLChatModel
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model_name_or_path | str | Yes | HuggingFace model ID or local path to checkpoint |
| torch_dtype | torch.dtype | No | Model precision (typically torch.bfloat16) |
| config | InternVLChatConfig | No | Configuration with vision_config, llm_config, template |
Outputs
| Name | Type | Description |
|---|---|---|
| model | InternVLChatModel | Composite model with vision_model, mlp1, and language_model submodules |
| forward() returns | CausalLMOutputWithPast | Loss and logits for training |
| chat() returns | str | Generated text response for inference |
Usage Examples
Loading for Fine-tuning (Path A)
import torch
from internvl.model.internvl_chat import InternVLChatModel
# Load complete model from checkpoint
model = InternVLChatModel.from_pretrained(
'OpenGVLab/InternVL2_5-8B',
torch_dtype=torch.bfloat16,
)
Assembly from Components (Path B)
from internvl.model.internvl_chat import InternVLChatModel, InternVLChatConfig
from internvl.model.internvl_chat.modeling_intern_vit import InternVisionModel
from transformers import AutoModelForCausalLM, AutoConfig
# Load separate components
vision_model = InternVisionModel.from_pretrained('path/to/InternViT-300M')
llm = AutoModelForCausalLM.from_pretrained('path/to/internlm2_5-7b-chat')
config = InternVLChatConfig.from_pretrained('path/to/config')
# Assemble composite model (MLP projector randomly initialized)
model = InternVLChatModel(config, vision_model=vision_model, language_model=llm)
Inference with chat()
from transformers import AutoTokenizer, GenerationConfig
tokenizer = AutoTokenizer.from_pretrained('OpenGVLab/InternVL2_5-8B', trust_remote_code=True)
generation_config = GenerationConfig(max_new_tokens=512, do_sample=False)
# Load and preprocess image
pixel_values = preprocess_image('photo.jpg') # [N_tiles, 3, 448, 448]
response = model.chat(
tokenizer=tokenizer,
pixel_values=pixel_values.to(model.device),
question='<image>\nDescribe this image in detail.',
generation_config=generation_config,
)
print(response)
Related Pages
Implements Principle
Requires Environment
Uses Heuristic
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment