Implementation:OpenGVLab InternVL InternViT 6B Model

Knowledge Sources	OpenGVLab_InternVL
Domains	Vision Transformer, Visual Encoder, LLaVA
Last Updated	2026-02-07 14:00 GMT

Overview

This module implements the InternViT-6B vision transformer model as a standalone visual encoder for use within the LLaVA multimodal framework.

Description

The InternViT-6B implementation provides a high-capacity vision transformer with several distinguishing features:

InternVisionEmbeddings converts images to patch tokens via a Conv2d projection, prepends a learnable class token, and adds interpolatable position embeddings using bicubic interpolation (via _get_pos_embed) to support variable input resolutions at inference time.

InternAttention implements multi-head self-attention with a fused QKV linear projection (single linear layer outputting 3x embed_dim), optional QK normalization using InternRMSNorm for training stability, and support for FlashAttention (when available) for memory-efficient attention computation. The _naive_attn path handles standard scaled dot-product attention, while _flash_attn uses the FlashAttention kernel with einops rearrangement.

InternMLP provides standard two-layer feed-forward with configurable activation (via ACT2FN).

InternVisionEncoderLayer combines attention and MLP with pre-norm (RMSNorm), learnable layer scale parameters (ls1, ls2), and stochastic depth (DropPath) with linearly increasing drop rates across layers.

InternVisionEncoder stacks encoder layers with gradient checkpointing enabled by default for memory efficiency during training.

InternVisionModel wraps embeddings and encoder, providing CLS token pooling. It includes a resize_pos_embeddings method for adapting position embeddings to new image sizes via bicubic interpolation.

The module attempts to use FusedRMSNorm from apex when available, falling back to a pure PyTorch InternRMSNorm implementation.

Usage

Use this as the primary visual backbone in the LLaVA-InternVL configuration, providing the 6B-parameter InternViT features that distinguish InternVL from standard CLIP-based LLaVA models.

Code Reference

Source Location

Repository: OpenGVLab_InternVL
File: internvl_chat_llava/llava/model/multimodal_encoder/intern_vit_6b/modeling_intern_vit.py
Lines: 1-354

Signature

class InternVisionModel(PreTrainedModel):
    main_input_name = 'pixel_values'
    config_class = InternVisionConfig

    def __init__(self, config: InternVisionConfig):
        ...

    def resize_pos_embeddings(self, old_size, new_size, patch_size):
        ...

    def forward(self, pixel_values=None, output_hidden_states=None,
                return_dict=None, pixel_embeds=None):
        ...

Import

from internvl_chat_llava.llava.model.multimodal_encoder.intern_vit_6b.modeling_intern_vit import (
    InternVisionModel,
    InternVisionEncoder,
    InternAttention,
)

I/O Contract

Inputs

Name	Type	Required	Description
pixel_values	torch.FloatTensor [batch, 3, height, width]	Yes (or pixel_embeds)	Input images for patch embedding
pixel_embeds	torch.FloatTensor [batch, seq_len, hidden_size]	No	Pre-computed patch embeddings (bypasses embedding layer)
output_hidden_states	bool	No	Whether to return all hidden states from each layer
return_dict	bool	No	Whether to return a ModelOutput instead of a tuple

Outputs

Name	Type	Description
last_hidden_state	torch.FloatTensor [batch, seq_len, hidden_size]	Hidden states from the last encoder layer
pooler_output	torch.FloatTensor [batch, hidden_size]	CLS token output (first token)
hidden_states	tuple(torch.FloatTensor)	All hidden states when output_hidden_states=True

Usage Examples

Basic Usage

from internvl_chat_llava.llava.model.multimodal_encoder.intern_vit_6b.modeling_intern_vit import (
    InternVisionModel
)
from internvl_chat_llava.llava.model.multimodal_encoder.intern_vit_6b.configuration_intern_vit import (
    InternVisionConfig
)

config = InternVisionConfig(
    hidden_size=3200, num_attention_heads=25, num_hidden_layers=48,
    image_size=224, patch_size=14, use_flash_attn=True,
)
model = InternVisionModel(config)

# Extract visual features
outputs = model(pixel_values=images)
visual_features = outputs.last_hidden_state   # [batch, num_patches+1, 3200]

# Resize for different image resolution
model.resize_pos_embeddings(old_size=224, new_size=448, patch_size=14)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment