Implementation:OpenGVLab InternVL InternVisionConfig
| Knowledge Sources | |
|---|---|
| Domains | Model Configuration, Vision Encoder, InternViT |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
Defines the InternVisionConfig configuration class for the InternViT-6B vision encoder, the visual backbone of the InternVL multimodal architecture.
Description
InternVisionConfig extends HuggingFace's PretrainedConfig with vision-specific parameters:
- Image and patch settings -- num_channels (3), patch_size (14), and image_size (224).
- Model dimensions -- hidden_size (3200), num_attention_heads (25), intermediate_size (12800), and num_hidden_layers (48).
- Attention features -- qkv_bias (False), qk_normalization (True), and use_flash_attn (True) for enabling flash attention.
- Normalization -- norm_type ("rms_norm") and layer_norm_eps (1e-6).
- Regularization -- dropout (0.0), drop_path_rate (0.0), and attention_dropout (0.0).
- Initialization -- initializer_range (0.02) and initializer_factor (0.1) for layer scale.
- Activation -- hidden_act ("gelu").
The model_type is set to "intern_vit_6b". The overridden from_pretrained class method handles nested vision_config dictionaries that may appear within larger InternVL model configs, extracting the vision sub-config automatically.
Usage
Use this configuration class when instantiating the InternViT-6B vision encoder within InternVL. It is typically loaded from a pretrained model directory or composed into a larger InternVL model config.
Code Reference
Source Location
- Repository: OpenGVLab_InternVL
- File: internvl_chat/internvl/model/internvl_chat/configuration_intern_vit.py
- Lines: 1-120
Signature
class InternVisionConfig(PretrainedConfig):
model_type = 'intern_vit_6b'
def __init__(self, num_channels=3, patch_size=14, image_size=224,
qkv_bias=False, hidden_size=3200,
num_attention_heads=25, intermediate_size=12800,
qk_normalization=True, num_hidden_layers=48,
use_flash_attn=True, hidden_act='gelu',
norm_type='rms_norm', layer_norm_eps=1e-6,
dropout=0.0, drop_path_rate=0.0,
attention_dropout=0.0, initializer_range=0.02,
initializer_factor=0.1, **kwargs): ...
@classmethod
def from_pretrained(cls, pretrained_model_name_or_path, **kwargs): ...
Import
from internvl.model.internvl_chat.configuration_intern_vit import InternVisionConfig
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| hidden_size | int | No | Dimensionality of encoder layers (default: 3200) |
| num_attention_heads | int | No | Number of attention heads (default: 25) |
| num_hidden_layers | int | No | Number of transformer layers (default: 48) |
| image_size | int | No | Input image resolution (default: 224) |
| patch_size | int | No | Patch resolution (default: 14) |
| use_flash_attn | bool | No | Enable flash attention (default: True) |
| qk_normalization | bool | No | Enable QK normalization (default: True) |
Outputs
| Name | Type | Description |
|---|---|---|
| config | InternVisionConfig | Configuration object for InternViT-6B vision encoder |
Usage Examples
Basic Usage
from internvl.model.internvl_chat.configuration_intern_vit import InternVisionConfig
# Create default InternViT-6B config
config = InternVisionConfig()
# Load from pretrained (handles nested vision_config)
config = InternVisionConfig.from_pretrained("path/to/model")