Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:OpenGVLab InternVL EvaCLIP Model

From Leeroopedia


Knowledge Sources
Domains Vision Transformer, Contrastive Learning, CLIP
Last Updated 2026-02-07 14:00 GMT

Overview

This module provides the complete PyTorch implementation of the EVA-CLIP model, including vision encoder, text encoder, and contrastive learning components, adapted for use as a multimodal encoder in the LLaVA framework.

Description

The implementation contains the full EVA-CLIP architecture with the following key components:

Vision pipeline: EvaCLIPVisionEmbeddings converts images to patch embeddings via a Conv2d layer, prepends a learnable class token, and adds positional embeddings. EvaCLIPAttention implements multi-head self-attention with configurable q/k/v bias flags (unlike standard CLIP which always uses bias). EvaCLIPMLP provides the feed-forward layers. EvaCLIPEncoderLayer supports both pre-layernorm and post-layernorm configurations via the post_layernorm flag. EvaCLIPEncoder stacks these layers with gradient checkpointing support. EvaCLIPVisionTransformer and EvaCLIPVisionModel wrap the encoder with pooling (CLS token).

Text pipeline: EvaCLIPTextEmbeddings handles token and position embeddings. EvaCLIPTextAttention provides text-specific attention (structurally similar to vision attention). EvaCLIPTextTransformer adds a causal attention mask and pools via the EOS token position. EvaCLIPTextModel wraps the text transformer.

Combined model: EvaCLIPModel combines both towers with visual_projection and text_projection linear layers, a learnable logit_scale parameter, and computes contrastive loss via symmetric cross-entropy over the similarity matrix.

Projection variants: EvaCLIPTextModelWithProjection and EvaCLIPVisionModelWithProjection provide standalone models with projection heads for extracting embeddings.

All models inherit from EvaCLIPPreTrainedModel which handles weight initialization with architecture-specific strategies.

Usage

Use EVA-CLIP as an alternative vision encoder to standard CLIP or InternViT in the LLaVA framework, particularly when stronger visual features from EVA-02 pretraining are desired.

Code Reference

Source Location

Signature

class EvaCLIPVisionModel(EvaCLIPPreTrainedModel):
    config_class = EvaCLIPVisionConfig
    def forward(self, pixel_values=None, output_attentions=None,
                output_hidden_states=None, return_dict=None):
        ...

class EvaCLIPTextModel(EvaCLIPPreTrainedModel):
    config_class = EvaCLIPTextConfig
    def forward(self, input_ids=None, attention_mask=None, position_ids=None,
                output_attentions=None, output_hidden_states=None, return_dict=None):
        ...

class EvaCLIPModel(EvaCLIPPreTrainedModel):
    config_class = EvaCLIPConfig
    def forward(self, input_ids=None, pixel_values=None, attention_mask=None,
                position_ids=None, return_loss=None, output_attentions=None,
                output_hidden_states=None, return_dict=None):
        ...

Import

from internvl_chat_llava.llava.model.multimodal_encoder.eva_clip.modeling_evaclip import (
    EvaCLIPVisionModel,
    EvaCLIPTextModel,
    EvaCLIPModel,
    EvaCLIPVisionModelWithProjection,
)

I/O Contract

Inputs

Name Type Required Description
pixel_values torch.FloatTensor [batch, channels, height, width] Yes (vision) Input images for the vision encoder
input_ids torch.LongTensor [batch, seq_len] Yes (text) Token IDs for the text encoder
attention_mask torch.Tensor [batch, seq_len] No Attention mask for text input
return_loss bool No Whether to compute contrastive loss (EvaCLIPModel)
output_hidden_states bool No Whether to return all hidden states
output_attentions bool No Whether to return attention weights

Outputs

Name Type Description
last_hidden_state torch.FloatTensor Sequence of hidden states from the last encoder layer
pooler_output torch.FloatTensor Pooled output (CLS token for vision, EOS token for text)
image_embeds torch.FloatTensor Projected image embeddings in shared space
text_embeds torch.FloatTensor Projected text embeddings in shared space
logits_per_image torch.FloatTensor Image-text similarity scores
loss torch.FloatTensor Contrastive loss when return_loss=True

Usage Examples

Basic Usage

from internvl_chat_llava.llava.model.multimodal_encoder.eva_clip.modeling_evaclip import (
    EvaCLIPVisionModel
)
from internvl_chat_llava.llava.model.multimodal_encoder.eva_clip.configuration_evaclip import (
    EvaCLIPVisionConfig
)

# Create vision encoder
config = EvaCLIPVisionConfig(hidden_size=1024, image_size=336, patch_size=14,
                              num_hidden_layers=24, num_attention_heads=16)
vision_model = EvaCLIPVisionModel(config)

# Extract visual features
outputs = vision_model(pixel_values=images)
visual_features = outputs.last_hidden_state  # [batch, num_patches+1, hidden_size]
pooled_features = outputs.pooler_output       # [batch, hidden_size]

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment