Implementation:OpenGVLab InternVL InternVisionConfig LLaVA
| Knowledge Sources | |
|---|---|
| Domains | Vision Encoder, Configuration, LLaVA |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
HuggingFace configuration class for the InternViT-6B vision encoder used within the LLaVA multimodal encoder module.
Description
InternVisionConfig extends PretrainedConfig with the model type intern_vit_6b and defines the complete set of architectural parameters for the InternViT-6B vision encoder. Key parameters include hidden_size (default 3200), num_attention_heads (default 25), intermediate_size (default 12800), num_hidden_layers (default 48), patch_size (default 14), and image_size (default 224). The configuration supports qkv_bias, qk_normalization (enabled by default), and use_flash_attn (enabled by default) for efficient attention computation. Dropout is configurable at multiple levels: dropout, drop_path_rate, and attention_dropout (all default to 0.0). The custom from_pretrained() classmethod handles nested configs by extracting the vision_config sub-dictionary when present, allowing this config to be loaded from composite model checkpoints.
Usage
Use this configuration class when loading or creating an InternViT-6B vision encoder within the LLaVA framework. It is used by the CLIPVisionTower for delayed loading and by InternVisionModel for model instantiation.
Code Reference
Source Location
- Repository: OpenGVLab_InternVL
- File: internvl_chat_llava/llava/model/multimodal_encoder/intern_vit_6b/configuration_intern_vit.py
- Lines: 1-117
Signature
class InternVisionConfig(PretrainedConfig):
model_type = 'intern_vit_6b'
def __init__(self, num_channels=3, patch_size=14, image_size=224,
qkv_bias=False, hidden_size=3200, num_attention_heads=25,
intermediate_size=12800, qk_normalization=True,
num_hidden_layers=48, use_flash_attn=True, hidden_act='gelu',
layer_norm_eps=1e-6, dropout=0.0, drop_path_rate=0.0,
attention_dropout=0.0, initializer_range=0.02,
initializer_factor=0.1, **kwargs): ...
@classmethod
def from_pretrained(cls, pretrained_model_name_or_path, **kwargs): ...
Import
from llava.model.multimodal_encoder.intern_vit_6b.configuration_intern_vit import InternVisionConfig
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| hidden_size | int | No | Dimensionality of encoder layers (default: 3200) |
| num_attention_heads | int | No | Number of attention heads (default: 25) |
| num_hidden_layers | int | No | Number of transformer layers (default: 48) |
| image_size | int | No | Input image resolution (default: 224) |
| patch_size | int | No | Size of each patch (default: 14) |
| use_flash_attn | bool | No | Whether to use flash attention (default: True) |
| qk_normalization | bool | No | Whether to normalize queries/keys (default: True) |
Outputs
| Name | Type | Description |
|---|---|---|
| config | InternVisionConfig | Configured InternViT-6B configuration instance |
Usage Examples
Basic Usage
from llava.model.multimodal_encoder.intern_vit_6b.configuration_intern_vit import InternVisionConfig
# Load from pretrained path
config = InternVisionConfig.from_pretrained("path/to/InternViT-6B")
# Or create with custom settings
config = InternVisionConfig(image_size=448, hidden_size=3200)