Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:OpenGVLab InternVL InternVisionConfig LLaVA

From Leeroopedia


Knowledge Sources
Domains Vision Encoder, Configuration, LLaVA
Last Updated 2026-02-07 14:00 GMT

Overview

HuggingFace configuration class for the InternViT-6B vision encoder used within the LLaVA multimodal encoder module.

Description

InternVisionConfig extends PretrainedConfig with the model type intern_vit_6b and defines the complete set of architectural parameters for the InternViT-6B vision encoder. Key parameters include hidden_size (default 3200), num_attention_heads (default 25), intermediate_size (default 12800), num_hidden_layers (default 48), patch_size (default 14), and image_size (default 224). The configuration supports qkv_bias, qk_normalization (enabled by default), and use_flash_attn (enabled by default) for efficient attention computation. Dropout is configurable at multiple levels: dropout, drop_path_rate, and attention_dropout (all default to 0.0). The custom from_pretrained() classmethod handles nested configs by extracting the vision_config sub-dictionary when present, allowing this config to be loaded from composite model checkpoints.

Usage

Use this configuration class when loading or creating an InternViT-6B vision encoder within the LLaVA framework. It is used by the CLIPVisionTower for delayed loading and by InternVisionModel for model instantiation.

Code Reference

Source Location

Signature

class InternVisionConfig(PretrainedConfig):
    model_type = 'intern_vit_6b'
    def __init__(self, num_channels=3, patch_size=14, image_size=224,
                 qkv_bias=False, hidden_size=3200, num_attention_heads=25,
                 intermediate_size=12800, qk_normalization=True,
                 num_hidden_layers=48, use_flash_attn=True, hidden_act='gelu',
                 layer_norm_eps=1e-6, dropout=0.0, drop_path_rate=0.0,
                 attention_dropout=0.0, initializer_range=0.02,
                 initializer_factor=0.1, **kwargs): ...
    @classmethod
    def from_pretrained(cls, pretrained_model_name_or_path, **kwargs): ...

Import

from llava.model.multimodal_encoder.intern_vit_6b.configuration_intern_vit import InternVisionConfig

I/O Contract

Inputs

Name Type Required Description
hidden_size int No Dimensionality of encoder layers (default: 3200)
num_attention_heads int No Number of attention heads (default: 25)
num_hidden_layers int No Number of transformer layers (default: 48)
image_size int No Input image resolution (default: 224)
patch_size int No Size of each patch (default: 14)
use_flash_attn bool No Whether to use flash attention (default: True)
qk_normalization bool No Whether to normalize queries/keys (default: True)

Outputs

Name Type Description
config InternVisionConfig Configured InternViT-6B configuration instance

Usage Examples

Basic Usage

from llava.model.multimodal_encoder.intern_vit_6b.configuration_intern_vit import InternVisionConfig

# Load from pretrained path
config = InternVisionConfig.from_pretrained("path/to/InternViT-6B")

# Or create with custom settings
config = InternVisionConfig(image_size=448, hidden_size=3200)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment