Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:OpenGVLab InternVL InternVLConfig LLaVA

From Leeroopedia


Knowledge Sources
Domains Multimodal Models, Configuration, LLaVA
Last Updated 2026-02-07 14:00 GMT

Overview

Composite HuggingFace configuration for the InternVL-14B model that combines InternViT vision and QLLaMA language configurations into a single unified config class.

Description

InternVLConfig extends PretrainedConfig with is_composition = True, composing two sub-configurations: InternVisionConfig for the vision encoder and LlamaConfig for the QLLaMA language backbone. The configuration initializes sub-configs from dictionaries, injecting InternVL-specific parameters into the QLLaMA config including num_query_token (default 96) and cross_attention_frequency (default 2). Additional top-level parameters include clip_embed_dim (default 768), attn_pool_num_heads (default 16), label_smoothing (default 0.0), use_backbone_lora and use_qllama_lora (both default 0, enabling LoRA when non-zero), and force_image_size for overriding the image resolution. The hidden_size is derived from the QLLaMA config. A custom to_dict() method serializes both sub-configurations for proper JSON serialization.

Usage

Use this configuration when loading or creating an InternVL-14B composite model that pairs InternViT-6B with a QLLaMA language model, within the LLaVA multimodal encoder framework.

Code Reference

Source Location

Signature

class InternVLConfig(PretrainedConfig):
    model_type = 'internvl'
    is_composition = True
    def __init__(self, vision_config=None, qllama_config=None,
                 clip_embed_dim=768, attn_pool_num_heads=16,
                 num_query_token=96, label_smoothing=0.0,
                 cross_attention_frequency=2, use_backbone_lora=0,
                 use_qllama_lora=0, force_image_size=None,
                 initializer_range=0.02, **kwargs): ...
    def to_dict(self): ...

Import

from llava.model.multimodal_encoder.internvl_14b.configuration_internvl import InternVLConfig

I/O Contract

Inputs

Name Type Required Description
vision_config dict No Configuration dict for InternVisionConfig (defaults to empty)
qllama_config dict No Configuration dict for LlamaConfig (defaults to empty)
num_query_token int No Number of query tokens (default: 96)
cross_attention_frequency int No Frequency of cross-attention layers (default: 2)
use_backbone_lora int No LoRA rank for backbone (0 = disabled)
use_qllama_lora int No LoRA rank for QLLaMA (0 = disabled)
force_image_size int or None No Override image resolution if set

Outputs

Name Type Description
config InternVLConfig Composite configuration with vision_config and qllama_config sub-configs

Usage Examples

Basic Usage

from llava.model.multimodal_encoder.internvl_14b.configuration_internvl import InternVLConfig

# Load from pretrained
config = InternVLConfig.from_pretrained("path/to/InternVL-14B")

# Access sub-configs
print(config.vision_config.hidden_size)  # 3200
print(config.qllama_config.hidden_size)  # LLaMA hidden size
print(config.num_query_token)  # 96

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment