Implementation:OpenGVLab InternVL InternVLConfig LLaVA

Knowledge Sources	OpenGVLab_InternVL
Domains	Multimodal Models, Configuration, LLaVA
Last Updated	2026-02-07 14:00 GMT

Overview

Composite HuggingFace configuration for the InternVL-14B model that combines InternViT vision and QLLaMA language configurations into a single unified config class.

Description

InternVLConfig extends PretrainedConfig with is_composition = True, composing two sub-configurations: InternVisionConfig for the vision encoder and LlamaConfig for the QLLaMA language backbone. The configuration initializes sub-configs from dictionaries, injecting InternVL-specific parameters into the QLLaMA config including num_query_token (default 96) and cross_attention_frequency (default 2). Additional top-level parameters include clip_embed_dim (default 768), attn_pool_num_heads (default 16), label_smoothing (default 0.0), use_backbone_lora and use_qllama_lora (both default 0, enabling LoRA when non-zero), and force_image_size for overriding the image resolution. The hidden_size is derived from the QLLaMA config. A custom to_dict() method serializes both sub-configurations for proper JSON serialization.

Usage

Use this configuration when loading or creating an InternVL-14B composite model that pairs InternViT-6B with a QLLaMA language model, within the LLaVA multimodal encoder framework.

Code Reference

Source Location

Repository: OpenGVLab_InternVL
File: internvl_chat_llava/llava/model/multimodal_encoder/internvl_14b/configuration_internvl.py
Lines: 1-108

Signature

class InternVLConfig(PretrainedConfig):
    model_type = 'internvl'
    is_composition = True
    def __init__(self, vision_config=None, qllama_config=None,
                 clip_embed_dim=768, attn_pool_num_heads=16,
                 num_query_token=96, label_smoothing=0.0,
                 cross_attention_frequency=2, use_backbone_lora=0,
                 use_qllama_lora=0, force_image_size=None,
                 initializer_range=0.02, **kwargs): ...
    def to_dict(self): ...

Import

from llava.model.multimodal_encoder.internvl_14b.configuration_internvl import InternVLConfig

I/O Contract

Inputs

Name	Type	Required	Description
vision_config	dict	No	Configuration dict for InternVisionConfig (defaults to empty)
qllama_config	dict	No	Configuration dict for LlamaConfig (defaults to empty)
num_query_token	int	No	Number of query tokens (default: 96)
cross_attention_frequency	int	No	Frequency of cross-attention layers (default: 2)
use_backbone_lora	int	No	LoRA rank for backbone (0 = disabled)
use_qllama_lora	int	No	LoRA rank for QLLaMA (0 = disabled)
force_image_size	int or None	No	Override image resolution if set

Outputs

Name	Type	Description
config	InternVLConfig	Composite configuration with vision_config and qllama_config sub-configs

Usage Examples

Basic Usage

from llava.model.multimodal_encoder.internvl_14b.configuration_internvl import InternVLConfig

# Load from pretrained
config = InternVLConfig.from_pretrained("path/to/InternVL-14B")

# Access sub-configs
print(config.vision_config.hidden_size)  # 3200
print(config.qllama_config.hidden_size)  # LLaMA hidden size
print(config.num_query_token)  # 96

Related Pages

Principle:OpenGVLab_InternVL_Vision_Encoder_Configuration

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment