Implementation:OpenGVLab InternVL EvaCLIP Configuration
| Knowledge Sources | |
|---|---|
| Domains | Model Configuration, Vision Transformer, CLIP |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
This module defines the HuggingFace-compatible configuration classes for the EVA-CLIP model, including separate text, vision, and composite configurations.
Description
The module provides three configuration classes adapted from HuggingFace's CLIP configuration:
EvaCLIPTextConfig stores parameters for the text encoder including vocab_size (default 49408), hidden_size (default 512), intermediate_size (default 2048), num_hidden_layers (default 12), num_attention_heads (default 8), max_position_embeddings (default 77), and EVA-specific options like q_bias, k_bias, v_bias (all default True), and post_layernorm (default False).
EvaCLIPVisionConfig stores parameters for the vision encoder including hidden_size (default 768), image_size (default 224), patch_size (default 32), num_channels (default 3), and the same q/k/v bias and post-layernorm options.
EvaCLIPConfig is the composite configuration that holds both text and vision sub-configs plus shared parameters like projection_dim (default 512) and logit_scale_init_value (default 2.6592). It supports construction from sub-configs via the from_text_vision_configs class method and handles backward-compatible merging of text_config_dict and vision_config_dict parameters.
All classes extend PretrainedConfig and support loading from pretrained model repositories.
Usage
Use these configuration classes when instantiating EVA-CLIP models as visual encoders in the LLaVA framework, or when loading pretrained EVA-CLIP checkpoints.
Code Reference
Source Location
- Repository: OpenGVLab_InternVL
- File: internvl_chat_llava/llava/model/multimodal_encoder/eva_clip/configuration_evaclip.py
- Lines: 1-425
Signature
class EvaCLIPTextConfig(PretrainedConfig):
model_type = "clip_text_model"
def __init__(self, vocab_size=49408, hidden_size=512, intermediate_size=2048,
projection_dim=512, num_hidden_layers=12, num_attention_heads=8,
max_position_embeddings=77, hidden_act="gelu", ...):
...
class EvaCLIPVisionConfig(PretrainedConfig):
model_type = "clip_vision_model"
def __init__(self, hidden_size=768, intermediate_size=3072, projection_dim=512,
num_hidden_layers=12, num_attention_heads=12, image_size=224,
patch_size=32, ...):
...
class EvaCLIPConfig(PretrainedConfig):
model_type = "clip"
is_composition = True
def __init__(self, text_config=None, vision_config=None,
projection_dim=512, logit_scale_init_value=2.6592, **kwargs):
...
Import
from internvl_chat_llava.llava.model.multimodal_encoder.eva_clip.configuration_evaclip import (
EvaCLIPTextConfig,
EvaCLIPVisionConfig,
EvaCLIPConfig,
)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| hidden_size | int | No | Dimensionality of encoder layers (default varies by text/vision) |
| num_hidden_layers | int | No | Number of transformer layers (default 12) |
| num_attention_heads | int | No | Number of attention heads per layer |
| image_size | int | No | Input image resolution (vision config, default 224) |
| patch_size | int | No | Patch size for vision embedding (default 32) |
| projection_dim | int | No | Dimension of shared projection space (default 512) |
| text_config | dict | No | Dict to initialize EvaCLIPTextConfig |
| vision_config | dict | No | Dict to initialize EvaCLIPVisionConfig |
Outputs
| Name | Type | Description |
|---|---|---|
| config | EvaCLIPConfig/EvaCLIPTextConfig/EvaCLIPVisionConfig | Configuration object storing all model hyperparameters |
Usage Examples
Basic Usage
from internvl_chat_llava.llava.model.multimodal_encoder.eva_clip.configuration_evaclip import (
EvaCLIPVisionConfig, EvaCLIPTextConfig, EvaCLIPConfig
)
# Create individual configs
vision_config = EvaCLIPVisionConfig(hidden_size=1024, image_size=336, patch_size=14)
text_config = EvaCLIPTextConfig(hidden_size=768, num_hidden_layers=12)
# Compose into full config
config = EvaCLIPConfig.from_text_vision_configs(text_config, vision_config)
# Or load from pretrained
config = EvaCLIPConfig.from_pretrained("QuanSun/EVA02_CLIP_E_psz14_plus_s9B")