Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:OpenGVLab InternVL InternViT 14B Model

From Leeroopedia
Revision as of 16:14, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/OpenGVLab_InternVL_InternViT_14B_Model.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains Vision Transformer, Visual Encoder, InternVL
Last Updated 2026-02-07 14:00 GMT

Overview

This module implements the InternViT vision transformer as the visual encoder component within the InternVL-14B composite model architecture.

Description

This is the same InternViT architecture as the standalone intern_vit_6b variant, co-located within the internvl_14b package for direct integration with the InternVL-14B composite model. The implementation is structurally identical to the intern_vit_6b version and provides:

InternVisionEmbeddings with patch embedding via Conv2d, learnable class token, and interpolatable position embeddings using bicubic interpolation for variable resolution support.

InternAttention with fused QKV projection, optional QK normalization (InternRMSNorm), and FlashAttention support. The attention module supports both a naive attention path and a flash attention path.

InternVisionEncoderLayer with pre-norm RMSNorm, learnable layer scale (ls1, ls2 parameters initialized to initializer_factor), and stochastic depth (DropPath) with linearly increasing drop rates.

InternVisionEncoder with gradient checkpointing enabled by default for training.

InternVisionModel with CLS token pooling and a resize_pos_embeddings method for adapting to different image sizes.

The module uses InternRMSNorm with automatic fallback to apex FusedRMSNorm when available.

This version is imported by modeling_internvl.py to construct the vision encoder component of the InternVL-14B model, alongside the QLLaMA query decoder.

Usage

Use this as the vision encoder within the InternVL-14B composite model, where it provides visual features to be processed through the QLLaMA cross-attention bridge.

Code Reference

Source Location

Signature

class InternVisionModel(PreTrainedModel):
    main_input_name = 'pixel_values'
    config_class = InternVisionConfig

    def __init__(self, config: InternVisionConfig):
        ...

    def resize_pos_embeddings(self, old_size, new_size, patch_size):
        ...

    def forward(self, pixel_values=None, output_hidden_states=None,
                return_dict=None, pixel_embeds=None):
        ...

Import

from internvl_chat_llava.llava.model.multimodal_encoder.internvl_14b.modeling_intern_vit import (
    InternVisionModel,
    InternVisionEmbeddings,
    InternVisionEncoder,
)

I/O Contract

Inputs

Name Type Required Description
pixel_values torch.FloatTensor [batch, 3, height, width] Yes (or pixel_embeds) Input images for patch embedding
pixel_embeds torch.FloatTensor [batch, seq_len, hidden_size] No Pre-computed patch embeddings (bypasses embedding layer)
output_hidden_states bool No Whether to return all hidden states
return_dict bool No Whether to return a ModelOutput

Outputs

Name Type Description
last_hidden_state torch.FloatTensor [batch, seq_len, hidden_size] Hidden states from the final encoder layer
pooler_output torch.FloatTensor [batch, hidden_size] CLS token output
hidden_states tuple(torch.FloatTensor) All hidden states when output_hidden_states=True

Usage Examples

Basic Usage

from internvl_chat_llava.llava.model.multimodal_encoder.internvl_14b.modeling_intern_vit import (
    InternVisionModel
)

# Typically instantiated as part of InternVLModel
# See modeling_internvl.py for integration context
vision_model = InternVisionModel(config.vision_config)
outputs = vision_model(pixel_values=images)
image_embeds = outputs.last_hidden_state  # [batch, num_patches+1, hidden_size]

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment