Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:OpenGVLab InternVL EvaCLIP Configuration

From Leeroopedia


Knowledge Sources
Domains Model Configuration, Vision Transformer, CLIP
Last Updated 2026-02-07 14:00 GMT

Overview

This module defines the HuggingFace-compatible configuration classes for the EVA-CLIP model, including separate text, vision, and composite configurations.

Description

The module provides three configuration classes adapted from HuggingFace's CLIP configuration:

EvaCLIPTextConfig stores parameters for the text encoder including vocab_size (default 49408), hidden_size (default 512), intermediate_size (default 2048), num_hidden_layers (default 12), num_attention_heads (default 8), max_position_embeddings (default 77), and EVA-specific options like q_bias, k_bias, v_bias (all default True), and post_layernorm (default False).

EvaCLIPVisionConfig stores parameters for the vision encoder including hidden_size (default 768), image_size (default 224), patch_size (default 32), num_channels (default 3), and the same q/k/v bias and post-layernorm options.

EvaCLIPConfig is the composite configuration that holds both text and vision sub-configs plus shared parameters like projection_dim (default 512) and logit_scale_init_value (default 2.6592). It supports construction from sub-configs via the from_text_vision_configs class method and handles backward-compatible merging of text_config_dict and vision_config_dict parameters.

All classes extend PretrainedConfig and support loading from pretrained model repositories.

Usage

Use these configuration classes when instantiating EVA-CLIP models as visual encoders in the LLaVA framework, or when loading pretrained EVA-CLIP checkpoints.

Code Reference

Source Location

Signature

class EvaCLIPTextConfig(PretrainedConfig):
    model_type = "clip_text_model"
    def __init__(self, vocab_size=49408, hidden_size=512, intermediate_size=2048,
                 projection_dim=512, num_hidden_layers=12, num_attention_heads=8,
                 max_position_embeddings=77, hidden_act="gelu", ...):
        ...

class EvaCLIPVisionConfig(PretrainedConfig):
    model_type = "clip_vision_model"
    def __init__(self, hidden_size=768, intermediate_size=3072, projection_dim=512,
                 num_hidden_layers=12, num_attention_heads=12, image_size=224,
                 patch_size=32, ...):
        ...

class EvaCLIPConfig(PretrainedConfig):
    model_type = "clip"
    is_composition = True
    def __init__(self, text_config=None, vision_config=None,
                 projection_dim=512, logit_scale_init_value=2.6592, **kwargs):
        ...

Import

from internvl_chat_llava.llava.model.multimodal_encoder.eva_clip.configuration_evaclip import (
    EvaCLIPTextConfig,
    EvaCLIPVisionConfig,
    EvaCLIPConfig,
)

I/O Contract

Inputs

Name Type Required Description
hidden_size int No Dimensionality of encoder layers (default varies by text/vision)
num_hidden_layers int No Number of transformer layers (default 12)
num_attention_heads int No Number of attention heads per layer
image_size int No Input image resolution (vision config, default 224)
patch_size int No Patch size for vision embedding (default 32)
projection_dim int No Dimension of shared projection space (default 512)
text_config dict No Dict to initialize EvaCLIPTextConfig
vision_config dict No Dict to initialize EvaCLIPVisionConfig

Outputs

Name Type Description
config EvaCLIPConfig/EvaCLIPTextConfig/EvaCLIPVisionConfig Configuration object storing all model hyperparameters

Usage Examples

Basic Usage

from internvl_chat_llava.llava.model.multimodal_encoder.eva_clip.configuration_evaclip import (
    EvaCLIPVisionConfig, EvaCLIPTextConfig, EvaCLIPConfig
)

# Create individual configs
vision_config = EvaCLIPVisionConfig(hidden_size=1024, image_size=336, patch_size=14)
text_config = EvaCLIPTextConfig(hidden_size=768, num_hidden_layers=12)

# Compose into full config
config = EvaCLIPConfig.from_text_vision_configs(text_config, vision_config)

# Or load from pretrained
config = EvaCLIPConfig.from_pretrained("QuanSun/EVA02_CLIP_E_psz14_plus_s9B")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment