Implementation:OpenGVLab InternVL InternViT 6B Model
| Knowledge Sources | |
|---|---|
| Domains | Vision Transformer, Visual Encoder, LLaVA |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
This module implements the InternViT-6B vision transformer model as a standalone visual encoder for use within the LLaVA multimodal framework.
Description
The InternViT-6B implementation provides a high-capacity vision transformer with several distinguishing features:
InternVisionEmbeddings converts images to patch tokens via a Conv2d projection, prepends a learnable class token, and adds interpolatable position embeddings using bicubic interpolation (via _get_pos_embed) to support variable input resolutions at inference time.
InternAttention implements multi-head self-attention with a fused QKV linear projection (single linear layer outputting 3x embed_dim), optional QK normalization using InternRMSNorm for training stability, and support for FlashAttention (when available) for memory-efficient attention computation. The _naive_attn path handles standard scaled dot-product attention, while _flash_attn uses the FlashAttention kernel with einops rearrangement.
InternMLP provides standard two-layer feed-forward with configurable activation (via ACT2FN).
InternVisionEncoderLayer combines attention and MLP with pre-norm (RMSNorm), learnable layer scale parameters (ls1, ls2), and stochastic depth (DropPath) with linearly increasing drop rates across layers.
InternVisionEncoder stacks encoder layers with gradient checkpointing enabled by default for memory efficiency during training.
InternVisionModel wraps embeddings and encoder, providing CLS token pooling. It includes a resize_pos_embeddings method for adapting position embeddings to new image sizes via bicubic interpolation.
The module attempts to use FusedRMSNorm from apex when available, falling back to a pure PyTorch InternRMSNorm implementation.
Usage
Use this as the primary visual backbone in the LLaVA-InternVL configuration, providing the 6B-parameter InternViT features that distinguish InternVL from standard CLIP-based LLaVA models.
Code Reference
Source Location
- Repository: OpenGVLab_InternVL
- File: internvl_chat_llava/llava/model/multimodal_encoder/intern_vit_6b/modeling_intern_vit.py
- Lines: 1-354
Signature
class InternVisionModel(PreTrainedModel):
main_input_name = 'pixel_values'
config_class = InternVisionConfig
def __init__(self, config: InternVisionConfig):
...
def resize_pos_embeddings(self, old_size, new_size, patch_size):
...
def forward(self, pixel_values=None, output_hidden_states=None,
return_dict=None, pixel_embeds=None):
...
Import
from internvl_chat_llava.llava.model.multimodal_encoder.intern_vit_6b.modeling_intern_vit import (
InternVisionModel,
InternVisionEncoder,
InternAttention,
)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| pixel_values | torch.FloatTensor [batch, 3, height, width] | Yes (or pixel_embeds) | Input images for patch embedding |
| pixel_embeds | torch.FloatTensor [batch, seq_len, hidden_size] | No | Pre-computed patch embeddings (bypasses embedding layer) |
| output_hidden_states | bool | No | Whether to return all hidden states from each layer |
| return_dict | bool | No | Whether to return a ModelOutput instead of a tuple |
Outputs
| Name | Type | Description |
|---|---|---|
| last_hidden_state | torch.FloatTensor [batch, seq_len, hidden_size] | Hidden states from the last encoder layer |
| pooler_output | torch.FloatTensor [batch, hidden_size] | CLS token output (first token) |
| hidden_states | tuple(torch.FloatTensor) | All hidden states when output_hidden_states=True |
Usage Examples
Basic Usage
from internvl_chat_llava.llava.model.multimodal_encoder.intern_vit_6b.modeling_intern_vit import (
InternVisionModel
)
from internvl_chat_llava.llava.model.multimodal_encoder.intern_vit_6b.configuration_intern_vit import (
InternVisionConfig
)
config = InternVisionConfig(
hidden_size=3200, num_attention_heads=25, num_hidden_layers=48,
image_size=224, patch_size=14, use_flash_attn=True,
)
model = InternVisionModel(config)
# Extract visual features
outputs = model(pixel_values=images)
visual_features = outputs.last_hidden_state # [batch, num_patches+1, 3200]
# Resize for different image resolution
model.resize_pos_embeddings(old_size=224, new_size=448, patch_size=14)