Implementation:OpenGVLab InternVL InternVLConfig LLaVA
| Knowledge Sources | |
|---|---|
| Domains | Multimodal Models, Configuration, LLaVA |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
Composite HuggingFace configuration for the InternVL-14B model that combines InternViT vision and QLLaMA language configurations into a single unified config class.
Description
InternVLConfig extends PretrainedConfig with is_composition = True, composing two sub-configurations: InternVisionConfig for the vision encoder and LlamaConfig for the QLLaMA language backbone. The configuration initializes sub-configs from dictionaries, injecting InternVL-specific parameters into the QLLaMA config including num_query_token (default 96) and cross_attention_frequency (default 2). Additional top-level parameters include clip_embed_dim (default 768), attn_pool_num_heads (default 16), label_smoothing (default 0.0), use_backbone_lora and use_qllama_lora (both default 0, enabling LoRA when non-zero), and force_image_size for overriding the image resolution. The hidden_size is derived from the QLLaMA config. A custom to_dict() method serializes both sub-configurations for proper JSON serialization.
Usage
Use this configuration when loading or creating an InternVL-14B composite model that pairs InternViT-6B with a QLLaMA language model, within the LLaVA multimodal encoder framework.
Code Reference
Source Location
- Repository: OpenGVLab_InternVL
- File: internvl_chat_llava/llava/model/multimodal_encoder/internvl_14b/configuration_internvl.py
- Lines: 1-108
Signature
class InternVLConfig(PretrainedConfig):
model_type = 'internvl'
is_composition = True
def __init__(self, vision_config=None, qllama_config=None,
clip_embed_dim=768, attn_pool_num_heads=16,
num_query_token=96, label_smoothing=0.0,
cross_attention_frequency=2, use_backbone_lora=0,
use_qllama_lora=0, force_image_size=None,
initializer_range=0.02, **kwargs): ...
def to_dict(self): ...
Import
from llava.model.multimodal_encoder.internvl_14b.configuration_internvl import InternVLConfig
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| vision_config | dict | No | Configuration dict for InternVisionConfig (defaults to empty) |
| qllama_config | dict | No | Configuration dict for LlamaConfig (defaults to empty) |
| num_query_token | int | No | Number of query tokens (default: 96) |
| cross_attention_frequency | int | No | Frequency of cross-attention layers (default: 2) |
| use_backbone_lora | int | No | LoRA rank for backbone (0 = disabled) |
| use_qllama_lora | int | No | LoRA rank for QLLaMA (0 = disabled) |
| force_image_size | int or None | No | Override image resolution if set |
Outputs
| Name | Type | Description |
|---|---|---|
| config | InternVLConfig | Composite configuration with vision_config and qllama_config sub-configs |
Usage Examples
Basic Usage
from llava.model.multimodal_encoder.internvl_14b.configuration_internvl import InternVLConfig
# Load from pretrained
config = InternVLConfig.from_pretrained("path/to/InternVL-14B")
# Access sub-configs
print(config.vision_config.hidden_size) # 3200
print(config.qllama_config.hidden_size) # LLaMA hidden size
print(config.num_query_token) # 96