Implementation:OpenGVLab InternVL InternViT 14B Model
| Knowledge Sources | |
|---|---|
| Domains | Vision Transformer, Visual Encoder, InternVL |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
This module implements the InternViT vision transformer as the visual encoder component within the InternVL-14B composite model architecture.
Description
This is the same InternViT architecture as the standalone intern_vit_6b variant, co-located within the internvl_14b package for direct integration with the InternVL-14B composite model. The implementation is structurally identical to the intern_vit_6b version and provides:
InternVisionEmbeddings with patch embedding via Conv2d, learnable class token, and interpolatable position embeddings using bicubic interpolation for variable resolution support.
InternAttention with fused QKV projection, optional QK normalization (InternRMSNorm), and FlashAttention support. The attention module supports both a naive attention path and a flash attention path.
InternVisionEncoderLayer with pre-norm RMSNorm, learnable layer scale (ls1, ls2 parameters initialized to initializer_factor), and stochastic depth (DropPath) with linearly increasing drop rates.
InternVisionEncoder with gradient checkpointing enabled by default for training.
InternVisionModel with CLS token pooling and a resize_pos_embeddings method for adapting to different image sizes.
The module uses InternRMSNorm with automatic fallback to apex FusedRMSNorm when available.
This version is imported by modeling_internvl.py to construct the vision encoder component of the InternVL-14B model, alongside the QLLaMA query decoder.
Usage
Use this as the vision encoder within the InternVL-14B composite model, where it provides visual features to be processed through the QLLaMA cross-attention bridge.
Code Reference
Source Location
- Repository: OpenGVLab_InternVL
- File: internvl_chat_llava/llava/model/multimodal_encoder/internvl_14b/modeling_intern_vit.py
- Lines: 1-354
Signature
class InternVisionModel(PreTrainedModel):
main_input_name = 'pixel_values'
config_class = InternVisionConfig
def __init__(self, config: InternVisionConfig):
...
def resize_pos_embeddings(self, old_size, new_size, patch_size):
...
def forward(self, pixel_values=None, output_hidden_states=None,
return_dict=None, pixel_embeds=None):
...
Import
from internvl_chat_llava.llava.model.multimodal_encoder.internvl_14b.modeling_intern_vit import (
InternVisionModel,
InternVisionEmbeddings,
InternVisionEncoder,
)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| pixel_values | torch.FloatTensor [batch, 3, height, width] | Yes (or pixel_embeds) | Input images for patch embedding |
| pixel_embeds | torch.FloatTensor [batch, seq_len, hidden_size] | No | Pre-computed patch embeddings (bypasses embedding layer) |
| output_hidden_states | bool | No | Whether to return all hidden states |
| return_dict | bool | No | Whether to return a ModelOutput |
Outputs
| Name | Type | Description |
|---|---|---|
| last_hidden_state | torch.FloatTensor [batch, seq_len, hidden_size] | Hidden states from the final encoder layer |
| pooler_output | torch.FloatTensor [batch, hidden_size] | CLS token output |
| hidden_states | tuple(torch.FloatTensor) | All hidden states when output_hidden_states=True |
Usage Examples
Basic Usage
from internvl_chat_llava.llava.model.multimodal_encoder.internvl_14b.modeling_intern_vit import (
InternVisionModel
)
# Typically instantiated as part of InternVLModel
# See modeling_internvl.py for integration context
vision_model = InternVisionModel(config.vision_config)
outputs = vision_model(pixel_values=images)
image_embeds = outputs.last_hidden_state # [batch, num_patches+1, hidden_size]