Implementation:OpenGVLab InternVL InternVisionModel From Pretrained
| Knowledge Sources | |
|---|---|
| Domains | Vision_Language, Model_Architecture |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Concrete tool for loading the InternViT vision encoder from a pretrained checkpoint for component assembly provided by the InternVL model framework.
Description
InternVisionModel is the vision encoder component of InternVL, based on the Vision Transformer (ViT) architecture. It processes image tiles into visual feature sequences. The model supports Flash Attention 2 and gradient checkpointing.
When used for component assembly (pretraining Path B), it is loaded separately and passed to the InternVLChatModel constructor.
Usage
Load this model when assembling InternVL from separate components during Stage 1 pretraining, or when extracting visual features independently.
Code Reference
Source Location
- Repository: InternVL
- File: internvl_chat/internvl/model/internvl_chat/modeling_intern_vit.py
- Lines: L364-431
Signature
class InternVisionModel(PreTrainedModel):
main_input_name = 'pixel_values'
_supports_flash_attn_2 = True
supports_gradient_checkpointing = True
config_class = InternVisionConfig
_no_split_modules = ['InternVisionEncoderLayer']
def __init__(self, config: InternVisionConfig):
"""
Args:
config: InternVisionConfig with:
hidden_size, intermediate_size, num_hidden_layers,
num_attention_heads, image_size, patch_size,
drop_path_rate, etc.
"""
def forward(
self,
pixel_values: Optional[torch.FloatTensor] = None,
output_hidden_states: Optional[bool] = None,
return_dict: Optional[bool] = None,
pixel_embeds: Optional[torch.FloatTensor] = None,
) -> Union[Tuple, BaseModelOutputWithPooling]:
"""
Args:
pixel_values: [B, 3, H, W] image tensors
pixel_embeds: Optional pre-computed embeddings (skip patch embedding)
Returns:
BaseModelOutputWithPooling with last_hidden_state [B, N_patches, D]
"""
Import
from internvl.model.internvl_chat.modeling_intern_vit import InternVisionModel
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| vision_path | str | Yes | Path to pretrained InternViT checkpoint |
| pixel_values | torch.FloatTensor | Yes | Image tensors [B, 3, H, W] |
| config.drop_path_rate | float | No | Stochastic depth rate (default 0.0) |
Outputs
| Name | Type | Description |
|---|---|---|
| model | InternVisionModel | Vision encoder ready for assembly into InternVLChatModel |
| forward() returns | BaseModelOutputWithPooling | Visual features [B, N_patches, D] |
Usage Examples
Load for Component Assembly
import torch
from internvl.model.internvl_chat.modeling_intern_vit import InternVisionModel
# Load pretrained vision encoder
vision_model = InternVisionModel.from_pretrained(
'./pretrained/InternViT-300M-448px',
torch_dtype=torch.bfloat16,
)
# Use for component assembly
from internvl.model.internvl_chat import InternVLChatModel, InternVLChatConfig
from transformers import AutoModelForCausalLM
llm = AutoModelForCausalLM.from_pretrained('./pretrained/internlm2_5-7b-chat')
config = InternVLChatConfig.from_pretrained('./pretrained/config')
model = InternVLChatModel(config, vision_model=vision_model, language_model=llm)