Implementation:OpenGVLab InternVL EvaCLIP Model
| Knowledge Sources | |
|---|---|
| Domains | Vision Transformer, Contrastive Learning, CLIP |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
This module provides the complete PyTorch implementation of the EVA-CLIP model, including vision encoder, text encoder, and contrastive learning components, adapted for use as a multimodal encoder in the LLaVA framework.
Description
The implementation contains the full EVA-CLIP architecture with the following key components:
Vision pipeline: EvaCLIPVisionEmbeddings converts images to patch embeddings via a Conv2d layer, prepends a learnable class token, and adds positional embeddings. EvaCLIPAttention implements multi-head self-attention with configurable q/k/v bias flags (unlike standard CLIP which always uses bias). EvaCLIPMLP provides the feed-forward layers. EvaCLIPEncoderLayer supports both pre-layernorm and post-layernorm configurations via the post_layernorm flag. EvaCLIPEncoder stacks these layers with gradient checkpointing support. EvaCLIPVisionTransformer and EvaCLIPVisionModel wrap the encoder with pooling (CLS token).
Text pipeline: EvaCLIPTextEmbeddings handles token and position embeddings. EvaCLIPTextAttention provides text-specific attention (structurally similar to vision attention). EvaCLIPTextTransformer adds a causal attention mask and pools via the EOS token position. EvaCLIPTextModel wraps the text transformer.
Combined model: EvaCLIPModel combines both towers with visual_projection and text_projection linear layers, a learnable logit_scale parameter, and computes contrastive loss via symmetric cross-entropy over the similarity matrix.
Projection variants: EvaCLIPTextModelWithProjection and EvaCLIPVisionModelWithProjection provide standalone models with projection heads for extracting embeddings.
All models inherit from EvaCLIPPreTrainedModel which handles weight initialization with architecture-specific strategies.
Usage
Use EVA-CLIP as an alternative vision encoder to standard CLIP or InternViT in the LLaVA framework, particularly when stronger visual features from EVA-02 pretraining are desired.
Code Reference
Source Location
- Repository: OpenGVLab_InternVL
- File: internvl_chat_llava/llava/model/multimodal_encoder/eva_clip/modeling_evaclip.py
- Lines: 1-1428
Signature
class EvaCLIPVisionModel(EvaCLIPPreTrainedModel):
config_class = EvaCLIPVisionConfig
def forward(self, pixel_values=None, output_attentions=None,
output_hidden_states=None, return_dict=None):
...
class EvaCLIPTextModel(EvaCLIPPreTrainedModel):
config_class = EvaCLIPTextConfig
def forward(self, input_ids=None, attention_mask=None, position_ids=None,
output_attentions=None, output_hidden_states=None, return_dict=None):
...
class EvaCLIPModel(EvaCLIPPreTrainedModel):
config_class = EvaCLIPConfig
def forward(self, input_ids=None, pixel_values=None, attention_mask=None,
position_ids=None, return_loss=None, output_attentions=None,
output_hidden_states=None, return_dict=None):
...
Import
from internvl_chat_llava.llava.model.multimodal_encoder.eva_clip.modeling_evaclip import (
EvaCLIPVisionModel,
EvaCLIPTextModel,
EvaCLIPModel,
EvaCLIPVisionModelWithProjection,
)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| pixel_values | torch.FloatTensor [batch, channels, height, width] | Yes (vision) | Input images for the vision encoder |
| input_ids | torch.LongTensor [batch, seq_len] | Yes (text) | Token IDs for the text encoder |
| attention_mask | torch.Tensor [batch, seq_len] | No | Attention mask for text input |
| return_loss | bool | No | Whether to compute contrastive loss (EvaCLIPModel) |
| output_hidden_states | bool | No | Whether to return all hidden states |
| output_attentions | bool | No | Whether to return attention weights |
Outputs
| Name | Type | Description |
|---|---|---|
| last_hidden_state | torch.FloatTensor | Sequence of hidden states from the last encoder layer |
| pooler_output | torch.FloatTensor | Pooled output (CLS token for vision, EOS token for text) |
| image_embeds | torch.FloatTensor | Projected image embeddings in shared space |
| text_embeds | torch.FloatTensor | Projected text embeddings in shared space |
| logits_per_image | torch.FloatTensor | Image-text similarity scores |
| loss | torch.FloatTensor | Contrastive loss when return_loss=True |
Usage Examples
Basic Usage
from internvl_chat_llava.llava.model.multimodal_encoder.eva_clip.modeling_evaclip import (
EvaCLIPVisionModel
)
from internvl_chat_llava.llava.model.multimodal_encoder.eva_clip.configuration_evaclip import (
EvaCLIPVisionConfig
)
# Create vision encoder
config = EvaCLIPVisionConfig(hidden_size=1024, image_size=336, patch_size=14,
num_hidden_layers=24, num_attention_heads=16)
vision_model = EvaCLIPVisionModel(config)
# Extract visual features
outputs = vision_model(pixel_values=images)
visual_features = outputs.last_hidden_state # [batch, num_patches+1, hidden_size]
pooled_features = outputs.pooler_output # [batch, hidden_size]