Implementation:OpenGVLab InternVL InternVL 14B Model

Knowledge Sources	OpenGVLab_InternVL
Domains	Multimodal, Vision-Language Model, Cross-Attention
Last Updated	2026-02-07 14:00 GMT

Overview

This module implements the InternVL-14B composite vision-language model that combines InternViT with a QLLaMA query decoder using learnable query tokens and cross-attention.

Description

The module provides the complete InternVL-14B architecture through several classes:

InternVLPreTrainedModel serves as the base class, defining weight initialization for Conv2d, Embedding, Linear, InternVisionEmbeddings, and LayerNorm modules, plus gradient checkpointing support for both InternVisionModel and InternVisionEncoder.

CrossAttention implements cross-attention between query tokens and key-value pairs (vision features), with optional QKV bias and configurable output dimension. AttentiveBlock wraps CrossAttention with pre-normalization of queries, keys, and values. AttentionPoolingBlock extends AttentiveBlock to pool a sequence by using the mean as the query, compressing the sequence to a single vector.

InternVLModel is the main composite model that assembles:

A frozen InternViT vision encoder for visual feature extraction
A frozen QLLaMA language model as the query decoder
Learnable query tokens (num_query_token x text_hidden_size) that are concatenated with text embeddings
Optional LoRA adapters on both the backbone (targeting attn.qkv, attn.proj, mlp.fc1, mlp.fc2) and QLLaMA (targeting self_attn projections and mlp projections) via wrap_backbone_lora and wrap_qllama_lora
Optional position embedding resizing for different image sizes via force_image_size

The model supports three forward paths:

forward: Extracts vision features, passes query tokens through QLLaMA with cross-attention to vision features
get_image_features: Returns both raw backbone embeddings and query-processed embeddings
get_text_features: Returns text-only hidden states from QLLaMA
generate: Full generation pipeline concatenating query tokens with text embeddings

InternVL_C and InternVL_G are specialized variants for contrastive and generative modes respectively, computing cosine similarity scores with learnable logit_scale.

Usage

Use this model for the InternVL-14B architecture, which provides a more powerful vision-language bridge than simple MLP projection by using query-based cross-attention to extract and compress visual features.

Code Reference

Source Location

Repository: OpenGVLab_InternVL
File: internvl_chat_llava/llava/model/multimodal_encoder/internvl_14b/modeling_internvl.py
Lines: 1-543

Signature

class InternVLModel(InternVLPreTrainedModel):
    config_class = InternVLConfig
    main_input_name = 'pixel_values'

    def __init__(self, config: InternVLConfig):
        ...

    def wrap_backbone_lora(self, r=128, lora_alpha=256, lora_dropout=0.05):
        ...

    def wrap_qllama_lora(self, r=128, lora_alpha=256, lora_dropout=0.05):
        ...

    def generate(self, pixel_values, input_ids, attention_mask,
                 generation_config=None, **generate_kwargs):
        ...

    def forward(self, pixel_values, output_hidden_states=None,
                return_dict=None):
        ...

class InternVL_C(InternVLModel):
    def forward(self, image, text):
        ...

class InternVL_G(InternVLModel):
    def forward(self, image, text):
        ...

Import

from internvl_chat_llava.llava.model.multimodal_encoder.internvl_14b.modeling_internvl import (
    InternVLModel,
    InternVL_C,
    InternVL_G,
)

I/O Contract

Inputs

Name	Type	Required	Description
pixel_values	torch.FloatTensor [batch, 3, height, width]	Yes	Input images for the vision encoder
input_ids	torch.LongTensor [batch, seq_len]	Yes (generate)	Text token IDs
attention_mask	torch.LongTensor [batch, seq_len]	Yes (generate)	Attention mask for text tokens
generation_config	GenerationConfig	No	HuggingFace generation configuration
output_hidden_states	bool	No	Whether to return all hidden states
return_dict	bool	No	Whether to return ModelOutput

Outputs

Name	Type	Description
vision_outputs	BaseModelOutputWithPooling	Vision encoder outputs with last_hidden_state and pooler_output
outputs	torch.Tensor	Query-processed hidden states from QLLaMA
logits_per_image	torch.FloatTensor	Image-text similarity scores (InternVL_C/G)
logits_per_text	torch.FloatTensor	Text-image similarity scores (InternVL_C/G)

Usage Examples

Basic Usage

from internvl_chat_llava.llava.model.multimodal_encoder.internvl_14b.modeling_internvl import (
    InternVLModel
)

# Load pretrained model
model = InternVLModel.from_pretrained("OpenGVLab/InternVL-14B")

# Apply LoRA for fine-tuning
model.wrap_backbone_lora(r=128, lora_alpha=256)
model.wrap_qllama_lora(r=128, lora_alpha=256)

# Forward pass (returns vision features + query-processed features)
vision_outputs, query_outputs = model(pixel_values=images)

# Generation
outputs = model.generate(
    pixel_values=images,
    input_ids=input_ids,
    attention_mask=attention_mask,
    generation_config=gen_config,
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment