Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:OpenGVLab InternVL InternVisionModel From Pretrained

From Leeroopedia


Knowledge Sources
Domains Vision_Language, Model_Architecture
Last Updated 2026-02-07 00:00 GMT

Overview

Concrete tool for loading the InternViT vision encoder from a pretrained checkpoint for component assembly provided by the InternVL model framework.

Description

InternVisionModel is the vision encoder component of InternVL, based on the Vision Transformer (ViT) architecture. It processes image tiles into visual feature sequences. The model supports Flash Attention 2 and gradient checkpointing.

When used for component assembly (pretraining Path B), it is loaded separately and passed to the InternVLChatModel constructor.

Usage

Load this model when assembling InternVL from separate components during Stage 1 pretraining, or when extracting visual features independently.

Code Reference

Source Location

  • Repository: InternVL
  • File: internvl_chat/internvl/model/internvl_chat/modeling_intern_vit.py
  • Lines: L364-431

Signature

class InternVisionModel(PreTrainedModel):
    main_input_name = 'pixel_values'
    _supports_flash_attn_2 = True
    supports_gradient_checkpointing = True
    config_class = InternVisionConfig
    _no_split_modules = ['InternVisionEncoderLayer']

    def __init__(self, config: InternVisionConfig):
        """
        Args:
            config: InternVisionConfig with:
                hidden_size, intermediate_size, num_hidden_layers,
                num_attention_heads, image_size, patch_size,
                drop_path_rate, etc.
        """

    def forward(
        self,
        pixel_values: Optional[torch.FloatTensor] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
        pixel_embeds: Optional[torch.FloatTensor] = None,
    ) -> Union[Tuple, BaseModelOutputWithPooling]:
        """
        Args:
            pixel_values: [B, 3, H, W] image tensors
            pixel_embeds: Optional pre-computed embeddings (skip patch embedding)
        Returns:
            BaseModelOutputWithPooling with last_hidden_state [B, N_patches, D]
        """

Import

from internvl.model.internvl_chat.modeling_intern_vit import InternVisionModel

I/O Contract

Inputs

Name Type Required Description
vision_path str Yes Path to pretrained InternViT checkpoint
pixel_values torch.FloatTensor Yes Image tensors [B, 3, H, W]
config.drop_path_rate float No Stochastic depth rate (default 0.0)

Outputs

Name Type Description
model InternVisionModel Vision encoder ready for assembly into InternVLChatModel
forward() returns BaseModelOutputWithPooling Visual features [B, N_patches, D]

Usage Examples

Load for Component Assembly

import torch
from internvl.model.internvl_chat.modeling_intern_vit import InternVisionModel

# Load pretrained vision encoder
vision_model = InternVisionModel.from_pretrained(
    './pretrained/InternViT-300M-448px',
    torch_dtype=torch.bfloat16,
)

# Use for component assembly
from internvl.model.internvl_chat import InternVLChatModel, InternVLChatConfig
from transformers import AutoModelForCausalLM

llm = AutoModelForCausalLM.from_pretrained('./pretrained/internlm2_5-7b-chat')
config = InternVLChatConfig.from_pretrained('./pretrained/config')
model = InternVLChatModel(config, vision_model=vision_model, language_model=llm)

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment