Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:OpenGVLab InternVL InternViT 6B Model

From Leeroopedia


Knowledge Sources
Domains Vision Transformer, Visual Encoder, LLaVA
Last Updated 2026-02-07 14:00 GMT

Overview

This module implements the InternViT-6B vision transformer model as a standalone visual encoder for use within the LLaVA multimodal framework.

Description

The InternViT-6B implementation provides a high-capacity vision transformer with several distinguishing features:

InternVisionEmbeddings converts images to patch tokens via a Conv2d projection, prepends a learnable class token, and adds interpolatable position embeddings using bicubic interpolation (via _get_pos_embed) to support variable input resolutions at inference time.

InternAttention implements multi-head self-attention with a fused QKV linear projection (single linear layer outputting 3x embed_dim), optional QK normalization using InternRMSNorm for training stability, and support for FlashAttention (when available) for memory-efficient attention computation. The _naive_attn path handles standard scaled dot-product attention, while _flash_attn uses the FlashAttention kernel with einops rearrangement.

InternMLP provides standard two-layer feed-forward with configurable activation (via ACT2FN).

InternVisionEncoderLayer combines attention and MLP with pre-norm (RMSNorm), learnable layer scale parameters (ls1, ls2), and stochastic depth (DropPath) with linearly increasing drop rates across layers.

InternVisionEncoder stacks encoder layers with gradient checkpointing enabled by default for memory efficiency during training.

InternVisionModel wraps embeddings and encoder, providing CLS token pooling. It includes a resize_pos_embeddings method for adapting position embeddings to new image sizes via bicubic interpolation.

The module attempts to use FusedRMSNorm from apex when available, falling back to a pure PyTorch InternRMSNorm implementation.

Usage

Use this as the primary visual backbone in the LLaVA-InternVL configuration, providing the 6B-parameter InternViT features that distinguish InternVL from standard CLIP-based LLaVA models.

Code Reference

Source Location

Signature

class InternVisionModel(PreTrainedModel):
    main_input_name = 'pixel_values'
    config_class = InternVisionConfig

    def __init__(self, config: InternVisionConfig):
        ...

    def resize_pos_embeddings(self, old_size, new_size, patch_size):
        ...

    def forward(self, pixel_values=None, output_hidden_states=None,
                return_dict=None, pixel_embeds=None):
        ...

Import

from internvl_chat_llava.llava.model.multimodal_encoder.intern_vit_6b.modeling_intern_vit import (
    InternVisionModel,
    InternVisionEncoder,
    InternAttention,
)

I/O Contract

Inputs

Name Type Required Description
pixel_values torch.FloatTensor [batch, 3, height, width] Yes (or pixel_embeds) Input images for patch embedding
pixel_embeds torch.FloatTensor [batch, seq_len, hidden_size] No Pre-computed patch embeddings (bypasses embedding layer)
output_hidden_states bool No Whether to return all hidden states from each layer
return_dict bool No Whether to return a ModelOutput instead of a tuple

Outputs

Name Type Description
last_hidden_state torch.FloatTensor [batch, seq_len, hidden_size] Hidden states from the last encoder layer
pooler_output torch.FloatTensor [batch, hidden_size] CLS token output (first token)
hidden_states tuple(torch.FloatTensor) All hidden states when output_hidden_states=True

Usage Examples

Basic Usage

from internvl_chat_llava.llava.model.multimodal_encoder.intern_vit_6b.modeling_intern_vit import (
    InternVisionModel
)
from internvl_chat_llava.llava.model.multimodal_encoder.intern_vit_6b.configuration_intern_vit import (
    InternVisionConfig
)

config = InternVisionConfig(
    hidden_size=3200, num_attention_heads=25, num_hidden_layers=48,
    image_size=224, patch_size=14, use_flash_attn=True,
)
model = InternVisionModel(config)

# Extract visual features
outputs = model(pixel_values=images)
visual_features = outputs.last_hidden_state   # [batch, num_patches+1, 3200]

# Resize for different image resolution
model.resize_pos_embeddings(old_size=224, new_size=448, patch_size=14)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment