Implementation:OpenGVLab InternVL InternViT 14B Model

Knowledge Sources	OpenGVLab_InternVL
Domains	Vision Transformer, Visual Encoder, InternVL
Last Updated	2026-02-07 14:00 GMT

Overview

This module implements the InternViT vision transformer as the visual encoder component within the InternVL-14B composite model architecture.

Description

This is the same InternViT architecture as the standalone intern_vit_6b variant, co-located within the internvl_14b package for direct integration with the InternVL-14B composite model. The implementation is structurally identical to the intern_vit_6b version and provides:

InternVisionEmbeddings with patch embedding via Conv2d, learnable class token, and interpolatable position embeddings using bicubic interpolation for variable resolution support.

InternAttention with fused QKV projection, optional QK normalization (InternRMSNorm), and FlashAttention support. The attention module supports both a naive attention path and a flash attention path.

InternVisionEncoderLayer with pre-norm RMSNorm, learnable layer scale (ls1, ls2 parameters initialized to initializer_factor), and stochastic depth (DropPath) with linearly increasing drop rates.

InternVisionEncoder with gradient checkpointing enabled by default for training.

InternVisionModel with CLS token pooling and a resize_pos_embeddings method for adapting to different image sizes.

The module uses InternRMSNorm with automatic fallback to apex FusedRMSNorm when available.

This version is imported by modeling_internvl.py to construct the vision encoder component of the InternVL-14B model, alongside the QLLaMA query decoder.

Usage

Use this as the vision encoder within the InternVL-14B composite model, where it provides visual features to be processed through the QLLaMA cross-attention bridge.

Code Reference

Source Location

Repository: OpenGVLab_InternVL
File: internvl_chat_llava/llava/model/multimodal_encoder/internvl_14b/modeling_intern_vit.py
Lines: 1-354

Signature

class InternVisionModel(PreTrainedModel):
    main_input_name = 'pixel_values'
    config_class = InternVisionConfig

    def __init__(self, config: InternVisionConfig):
        ...

    def resize_pos_embeddings(self, old_size, new_size, patch_size):
        ...

    def forward(self, pixel_values=None, output_hidden_states=None,
                return_dict=None, pixel_embeds=None):
        ...

Import

from internvl_chat_llava.llava.model.multimodal_encoder.internvl_14b.modeling_intern_vit import (
    InternVisionModel,
    InternVisionEmbeddings,
    InternVisionEncoder,
)

I/O Contract

Inputs

Name	Type	Required	Description
pixel_values	torch.FloatTensor [batch, 3, height, width]	Yes (or pixel_embeds)	Input images for patch embedding
pixel_embeds	torch.FloatTensor [batch, seq_len, hidden_size]	No	Pre-computed patch embeddings (bypasses embedding layer)
output_hidden_states	bool	No	Whether to return all hidden states
return_dict	bool	No	Whether to return a ModelOutput

Outputs

Name	Type	Description
last_hidden_state	torch.FloatTensor [batch, seq_len, hidden_size]	Hidden states from the final encoder layer
pooler_output	torch.FloatTensor [batch, hidden_size]	CLS token output
hidden_states	tuple(torch.FloatTensor)	All hidden states when output_hidden_states=True

Usage Examples

Basic Usage

from internvl_chat_llava.llava.model.multimodal_encoder.internvl_14b.modeling_intern_vit import (
    InternVisionModel
)

# Typically instantiated as part of InternVLModel
# See modeling_internvl.py for integration context
vision_model = InternVisionModel(config.vision_config)
outputs = vision_model(pixel_values=images)
image_embeds = outputs.last_hidden_state  # [batch, num_patches+1, hidden_size]

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment