Implementation:OpenGVLab InternVL EvaCLIP Model

Knowledge Sources	OpenGVLab_InternVL
Domains	Vision Transformer, Contrastive Learning, CLIP
Last Updated	2026-02-07 14:00 GMT

Overview

This module provides the complete PyTorch implementation of the EVA-CLIP model, including vision encoder, text encoder, and contrastive learning components, adapted for use as a multimodal encoder in the LLaVA framework.

Description

The implementation contains the full EVA-CLIP architecture with the following key components:

Vision pipeline: EvaCLIPVisionEmbeddings converts images to patch embeddings via a Conv2d layer, prepends a learnable class token, and adds positional embeddings. EvaCLIPAttention implements multi-head self-attention with configurable q/k/v bias flags (unlike standard CLIP which always uses bias). EvaCLIPMLP provides the feed-forward layers. EvaCLIPEncoderLayer supports both pre-layernorm and post-layernorm configurations via the post_layernorm flag. EvaCLIPEncoder stacks these layers with gradient checkpointing support. EvaCLIPVisionTransformer and EvaCLIPVisionModel wrap the encoder with pooling (CLS token).

Text pipeline: EvaCLIPTextEmbeddings handles token and position embeddings. EvaCLIPTextAttention provides text-specific attention (structurally similar to vision attention). EvaCLIPTextTransformer adds a causal attention mask and pools via the EOS token position. EvaCLIPTextModel wraps the text transformer.

Combined model: EvaCLIPModel combines both towers with visual_projection and text_projection linear layers, a learnable logit_scale parameter, and computes contrastive loss via symmetric cross-entropy over the similarity matrix.

Projection variants: EvaCLIPTextModelWithProjection and EvaCLIPVisionModelWithProjection provide standalone models with projection heads for extracting embeddings.

All models inherit from EvaCLIPPreTrainedModel which handles weight initialization with architecture-specific strategies.

Usage

Use EVA-CLIP as an alternative vision encoder to standard CLIP or InternViT in the LLaVA framework, particularly when stronger visual features from EVA-02 pretraining are desired.

Code Reference

Source Location

Repository: OpenGVLab_InternVL
File: internvl_chat_llava/llava/model/multimodal_encoder/eva_clip/modeling_evaclip.py
Lines: 1-1428

Signature

class EvaCLIPVisionModel(EvaCLIPPreTrainedModel):
    config_class = EvaCLIPVisionConfig
    def forward(self, pixel_values=None, output_attentions=None,
                output_hidden_states=None, return_dict=None):
        ...

class EvaCLIPTextModel(EvaCLIPPreTrainedModel):
    config_class = EvaCLIPTextConfig
    def forward(self, input_ids=None, attention_mask=None, position_ids=None,
                output_attentions=None, output_hidden_states=None, return_dict=None):
        ...

class EvaCLIPModel(EvaCLIPPreTrainedModel):
    config_class = EvaCLIPConfig
    def forward(self, input_ids=None, pixel_values=None, attention_mask=None,
                position_ids=None, return_loss=None, output_attentions=None,
                output_hidden_states=None, return_dict=None):
        ...

Import

from internvl_chat_llava.llava.model.multimodal_encoder.eva_clip.modeling_evaclip import (
    EvaCLIPVisionModel,
    EvaCLIPTextModel,
    EvaCLIPModel,
    EvaCLIPVisionModelWithProjection,
)

I/O Contract

Inputs

Name	Type	Required	Description
pixel_values	torch.FloatTensor [batch, channels, height, width]	Yes (vision)	Input images for the vision encoder
input_ids	torch.LongTensor [batch, seq_len]	Yes (text)	Token IDs for the text encoder
attention_mask	torch.Tensor [batch, seq_len]	No	Attention mask for text input
return_loss	bool	No	Whether to compute contrastive loss (EvaCLIPModel)
output_hidden_states	bool	No	Whether to return all hidden states
output_attentions	bool	No	Whether to return attention weights

Outputs

Name	Type	Description
last_hidden_state	torch.FloatTensor	Sequence of hidden states from the last encoder layer
pooler_output	torch.FloatTensor	Pooled output (CLS token for vision, EOS token for text)
image_embeds	torch.FloatTensor	Projected image embeddings in shared space
text_embeds	torch.FloatTensor	Projected text embeddings in shared space
logits_per_image	torch.FloatTensor	Image-text similarity scores
loss	torch.FloatTensor	Contrastive loss when return_loss=True

Usage Examples

Basic Usage

from internvl_chat_llava.llava.model.multimodal_encoder.eva_clip.modeling_evaclip import (
    EvaCLIPVisionModel
)
from internvl_chat_llava.llava.model.multimodal_encoder.eva_clip.configuration_evaclip import (
    EvaCLIPVisionConfig
)

# Create vision encoder
config = EvaCLIPVisionConfig(hidden_size=1024, image_size=336, patch_size=14,
                              num_hidden_layers=24, num_attention_heads=16)
vision_model = EvaCLIPVisionModel(config)

# Extract visual features
outputs = vision_model(pixel_values=images)
visual_features = outputs.last_hidden_state  # [batch, num_patches+1, hidden_size]
pooled_features = outputs.pooler_output       # [batch, hidden_size]

Related Pages

Principle:OpenGVLab_InternVL_CLIP_Vision_Text_Encoding

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment