Implementation:OpenGVLab InternVL InternVLChatModel From Pretrained

Knowledge Sources	InternVL HuggingFace Transformers
Domains	Vision_Language, Model_Architecture
Last Updated	2026-02-07 00:00 GMT

Overview

Concrete tool for loading the InternVL composite vision-language model from pretrained checkpoints provided by the InternVL framework.

Description

InternVLChatModel is the central multimodal model class in InternVL, extending HuggingFace's PreTrainedModel. It combines:

vision_model (InternViT): Vision encoder processing image tiles
mlp1 (MLP projector): 2-layer MLP with pixel shuffle that maps vision features to LLM space
language_model: Interchangeable LLM backbone (InternLM2, Qwen2, LLaMA, Phi-3)

The model supports two loading paths:

from_pretrained(): Loads all components from a single checkpoint
__init__(config, vision_model, language_model): Assembles from separate components

Usage

Import this class for any InternVL training or inference task. Use from_pretrained when loading an existing InternVL checkpoint for fine-tuning or evaluation.

Code Reference

Source Location

Repository: InternVL
File: internvl_chat/internvl/model/internvl_chat/modeling_internvl_chat.py
Lines: L30-398

Signature

class InternVLChatModel(PreTrainedModel):
    config_class = InternVLChatConfig
    main_input_name = 'pixel_values'
    base_model_prefix = 'language_model'

    def __init__(
        self,
        config: InternVLChatConfig,
        vision_model=None,
        language_model=None,
        use_flash_attn=True,
    ):
        """
        Args:
            config: InternVLChatConfig with vision_config, llm_config, and template
            vision_model: Optional pre-instantiated InternVisionModel (Path B assembly)
            language_model: Optional pre-instantiated LLM (Path B assembly)
            use_flash_attn: Enable Flash Attention 2 (default True)
        """

    def forward(
        self,
        pixel_values: torch.FloatTensor,
        input_ids: torch.LongTensor = None,
        attention_mask: Optional[torch.Tensor] = None,
        position_ids: Optional[torch.LongTensor] = None,
        image_flags: Optional[torch.LongTensor] = None,
        past_key_values: Optional[List[torch.FloatTensor]] = None,
        labels: Optional[torch.LongTensor] = None,
        use_cache: Optional[bool] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
        statistics: Optional[torch.LongTensor] = None,
        loss_weight: Optional[List] = None,
        loss_reduction_all_gather: Optional[bool] = False,
    ) -> Union[Tuple, CausalLMOutputWithPast]:

    def chat(
        self,
        tokenizer,
        pixel_values,
        question,
        generation_config,
        history=None,
        return_history=False,
        num_patches_list=None,
        IMG_START_TOKEN='<img>',
        IMG_END_TOKEN='</img>',
        IMG_CONTEXT_TOKEN='<IMG_CONTEXT>',
        verbose=False,
    ) -> str:

Import

from internvl.model.internvl_chat import InternVLChatModel

I/O Contract

Inputs

Name	Type	Required	Description
model_name_or_path	str	Yes	HuggingFace model ID or local path to checkpoint
torch_dtype	torch.dtype	No	Model precision (typically torch.bfloat16)
config	InternVLChatConfig	No	Configuration with vision_config, llm_config, template

Outputs

Name	Type	Description
model	InternVLChatModel	Composite model with vision_model, mlp1, and language_model submodules
forward() returns	CausalLMOutputWithPast	Loss and logits for training
chat() returns	str	Generated text response for inference

Usage Examples

Loading for Fine-tuning (Path A)

import torch
from internvl.model.internvl_chat import InternVLChatModel

# Load complete model from checkpoint
model = InternVLChatModel.from_pretrained(
    'OpenGVLab/InternVL2_5-8B',
    torch_dtype=torch.bfloat16,
)

Assembly from Components (Path B)

from internvl.model.internvl_chat import InternVLChatModel, InternVLChatConfig
from internvl.model.internvl_chat.modeling_intern_vit import InternVisionModel
from transformers import AutoModelForCausalLM, AutoConfig

# Load separate components
vision_model = InternVisionModel.from_pretrained('path/to/InternViT-300M')
llm = AutoModelForCausalLM.from_pretrained('path/to/internlm2_5-7b-chat')
config = InternVLChatConfig.from_pretrained('path/to/config')

# Assemble composite model (MLP projector randomly initialized)
model = InternVLChatModel(config, vision_model=vision_model, language_model=llm)

Inference with chat()

from transformers import AutoTokenizer, GenerationConfig

tokenizer = AutoTokenizer.from_pretrained('OpenGVLab/InternVL2_5-8B', trust_remote_code=True)
generation_config = GenerationConfig(max_new_tokens=512, do_sample=False)

# Load and preprocess image
pixel_values = preprocess_image('photo.jpg')  # [N_tiles, 3, 448, 448]

response = model.chat(
    tokenizer=tokenizer,
    pixel_values=pixel_values.to(model.device),
    question='<image>\nDescribe this image in detail.',
    generation_config=generation_config,
)
print(response)

Related Pages

Implements Principle

Principle:OpenGVLab_InternVL_Vision_Language_Model_Loading

Requires Environment

Uses Heuristic

Heuristic:OpenGVLab_InternVL_Pixel_Shuffle_Downsampling

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment