Implementation:OpenGVLab InternVL InternVL 14B Model
| Knowledge Sources | |
|---|---|
| Domains | Multimodal, Vision-Language Model, Cross-Attention |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
This module implements the InternVL-14B composite vision-language model that combines InternViT with a QLLaMA query decoder using learnable query tokens and cross-attention.
Description
The module provides the complete InternVL-14B architecture through several classes:
InternVLPreTrainedModel serves as the base class, defining weight initialization for Conv2d, Embedding, Linear, InternVisionEmbeddings, and LayerNorm modules, plus gradient checkpointing support for both InternVisionModel and InternVisionEncoder.
CrossAttention implements cross-attention between query tokens and key-value pairs (vision features), with optional QKV bias and configurable output dimension. AttentiveBlock wraps CrossAttention with pre-normalization of queries, keys, and values. AttentionPoolingBlock extends AttentiveBlock to pool a sequence by using the mean as the query, compressing the sequence to a single vector.
InternVLModel is the main composite model that assembles:
- A frozen InternViT vision encoder for visual feature extraction
- A frozen QLLaMA language model as the query decoder
- Learnable query tokens (num_query_token x text_hidden_size) that are concatenated with text embeddings
- Optional LoRA adapters on both the backbone (targeting attn.qkv, attn.proj, mlp.fc1, mlp.fc2) and QLLaMA (targeting self_attn projections and mlp projections) via wrap_backbone_lora and wrap_qllama_lora
- Optional position embedding resizing for different image sizes via force_image_size
The model supports three forward paths:
- forward: Extracts vision features, passes query tokens through QLLaMA with cross-attention to vision features
- get_image_features: Returns both raw backbone embeddings and query-processed embeddings
- get_text_features: Returns text-only hidden states from QLLaMA
- generate: Full generation pipeline concatenating query tokens with text embeddings
InternVL_C and InternVL_G are specialized variants for contrastive and generative modes respectively, computing cosine similarity scores with learnable logit_scale.
Usage
Use this model for the InternVL-14B architecture, which provides a more powerful vision-language bridge than simple MLP projection by using query-based cross-attention to extract and compress visual features.
Code Reference
Source Location
- Repository: OpenGVLab_InternVL
- File: internvl_chat_llava/llava/model/multimodal_encoder/internvl_14b/modeling_internvl.py
- Lines: 1-543
Signature
class InternVLModel(InternVLPreTrainedModel):
config_class = InternVLConfig
main_input_name = 'pixel_values'
def __init__(self, config: InternVLConfig):
...
def wrap_backbone_lora(self, r=128, lora_alpha=256, lora_dropout=0.05):
...
def wrap_qllama_lora(self, r=128, lora_alpha=256, lora_dropout=0.05):
...
def generate(self, pixel_values, input_ids, attention_mask,
generation_config=None, **generate_kwargs):
...
def forward(self, pixel_values, output_hidden_states=None,
return_dict=None):
...
class InternVL_C(InternVLModel):
def forward(self, image, text):
...
class InternVL_G(InternVLModel):
def forward(self, image, text):
...
Import
from internvl_chat_llava.llava.model.multimodal_encoder.internvl_14b.modeling_internvl import (
InternVLModel,
InternVL_C,
InternVL_G,
)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| pixel_values | torch.FloatTensor [batch, 3, height, width] | Yes | Input images for the vision encoder |
| input_ids | torch.LongTensor [batch, seq_len] | Yes (generate) | Text token IDs |
| attention_mask | torch.LongTensor [batch, seq_len] | Yes (generate) | Attention mask for text tokens |
| generation_config | GenerationConfig | No | HuggingFace generation configuration |
| output_hidden_states | bool | No | Whether to return all hidden states |
| return_dict | bool | No | Whether to return ModelOutput |
Outputs
| Name | Type | Description |
|---|---|---|
| vision_outputs | BaseModelOutputWithPooling | Vision encoder outputs with last_hidden_state and pooler_output |
| outputs | torch.Tensor | Query-processed hidden states from QLLaMA |
| logits_per_image | torch.FloatTensor | Image-text similarity scores (InternVL_C/G) |
| logits_per_text | torch.FloatTensor | Text-image similarity scores (InternVL_C/G) |
Usage Examples
Basic Usage
from internvl_chat_llava.llava.model.multimodal_encoder.internvl_14b.modeling_internvl import (
InternVLModel
)
# Load pretrained model
model = InternVLModel.from_pretrained("OpenGVLab/InternVL-14B")
# Apply LoRA for fine-tuning
model.wrap_backbone_lora(r=128, lora_alpha=256)
model.wrap_qllama_lora(r=128, lora_alpha=256)
# Forward pass (returns vision features + query-processed features)
vision_outputs, query_outputs = model(pixel_values=images)
# Generation
outputs = model.generate(
pixel_values=images,
input_ids=input_ids,
attention_mask=attention_mask,
generation_config=gen_config,
)