Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:OpenGVLab InternVL InternVisionConfig

From Leeroopedia


Knowledge Sources
Domains Model Configuration, Vision Encoder, InternViT
Last Updated 2026-02-07 14:00 GMT

Overview

Defines the InternVisionConfig configuration class for the InternViT-6B vision encoder, the visual backbone of the InternVL multimodal architecture.

Description

InternVisionConfig extends HuggingFace's PretrainedConfig with vision-specific parameters:

  • Image and patch settings -- num_channels (3), patch_size (14), and image_size (224).
  • Model dimensions -- hidden_size (3200), num_attention_heads (25), intermediate_size (12800), and num_hidden_layers (48).
  • Attention features -- qkv_bias (False), qk_normalization (True), and use_flash_attn (True) for enabling flash attention.
  • Normalization -- norm_type ("rms_norm") and layer_norm_eps (1e-6).
  • Regularization -- dropout (0.0), drop_path_rate (0.0), and attention_dropout (0.0).
  • Initialization -- initializer_range (0.02) and initializer_factor (0.1) for layer scale.
  • Activation -- hidden_act ("gelu").

The model_type is set to "intern_vit_6b". The overridden from_pretrained class method handles nested vision_config dictionaries that may appear within larger InternVL model configs, extracting the vision sub-config automatically.

Usage

Use this configuration class when instantiating the InternViT-6B vision encoder within InternVL. It is typically loaded from a pretrained model directory or composed into a larger InternVL model config.

Code Reference

Source Location

Signature

class InternVisionConfig(PretrainedConfig):
    model_type = 'intern_vit_6b'

    def __init__(self, num_channels=3, patch_size=14, image_size=224,
                 qkv_bias=False, hidden_size=3200,
                 num_attention_heads=25, intermediate_size=12800,
                 qk_normalization=True, num_hidden_layers=48,
                 use_flash_attn=True, hidden_act='gelu',
                 norm_type='rms_norm', layer_norm_eps=1e-6,
                 dropout=0.0, drop_path_rate=0.0,
                 attention_dropout=0.0, initializer_range=0.02,
                 initializer_factor=0.1, **kwargs): ...

    @classmethod
    def from_pretrained(cls, pretrained_model_name_or_path, **kwargs): ...

Import

from internvl.model.internvl_chat.configuration_intern_vit import InternVisionConfig

I/O Contract

Inputs

Name Type Required Description
hidden_size int No Dimensionality of encoder layers (default: 3200)
num_attention_heads int No Number of attention heads (default: 25)
num_hidden_layers int No Number of transformer layers (default: 48)
image_size int No Input image resolution (default: 224)
patch_size int No Patch resolution (default: 14)
use_flash_attn bool No Enable flash attention (default: True)
qk_normalization bool No Enable QK normalization (default: True)

Outputs

Name Type Description
config InternVisionConfig Configuration object for InternViT-6B vision encoder

Usage Examples

Basic Usage

from internvl.model.internvl_chat.configuration_intern_vit import InternVisionConfig

# Create default InternViT-6B config
config = InternVisionConfig()

# Load from pretrained (handles nested vision_config)
config = InternVisionConfig.from_pretrained("path/to/model")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment