Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:OpenGVLab InternVL InternVL 14B Model

From Leeroopedia


Knowledge Sources
Domains Multimodal, Vision-Language Model, Cross-Attention
Last Updated 2026-02-07 14:00 GMT

Overview

This module implements the InternVL-14B composite vision-language model that combines InternViT with a QLLaMA query decoder using learnable query tokens and cross-attention.

Description

The module provides the complete InternVL-14B architecture through several classes:

InternVLPreTrainedModel serves as the base class, defining weight initialization for Conv2d, Embedding, Linear, InternVisionEmbeddings, and LayerNorm modules, plus gradient checkpointing support for both InternVisionModel and InternVisionEncoder.

CrossAttention implements cross-attention between query tokens and key-value pairs (vision features), with optional QKV bias and configurable output dimension. AttentiveBlock wraps CrossAttention with pre-normalization of queries, keys, and values. AttentionPoolingBlock extends AttentiveBlock to pool a sequence by using the mean as the query, compressing the sequence to a single vector.

InternVLModel is the main composite model that assembles:

  • A frozen InternViT vision encoder for visual feature extraction
  • A frozen QLLaMA language model as the query decoder
  • Learnable query tokens (num_query_token x text_hidden_size) that are concatenated with text embeddings
  • Optional LoRA adapters on both the backbone (targeting attn.qkv, attn.proj, mlp.fc1, mlp.fc2) and QLLaMA (targeting self_attn projections and mlp projections) via wrap_backbone_lora and wrap_qllama_lora
  • Optional position embedding resizing for different image sizes via force_image_size

The model supports three forward paths:

  • forward: Extracts vision features, passes query tokens through QLLaMA with cross-attention to vision features
  • get_image_features: Returns both raw backbone embeddings and query-processed embeddings
  • get_text_features: Returns text-only hidden states from QLLaMA
  • generate: Full generation pipeline concatenating query tokens with text embeddings

InternVL_C and InternVL_G are specialized variants for contrastive and generative modes respectively, computing cosine similarity scores with learnable logit_scale.

Usage

Use this model for the InternVL-14B architecture, which provides a more powerful vision-language bridge than simple MLP projection by using query-based cross-attention to extract and compress visual features.

Code Reference

Source Location

Signature

class InternVLModel(InternVLPreTrainedModel):
    config_class = InternVLConfig
    main_input_name = 'pixel_values'

    def __init__(self, config: InternVLConfig):
        ...

    def wrap_backbone_lora(self, r=128, lora_alpha=256, lora_dropout=0.05):
        ...

    def wrap_qllama_lora(self, r=128, lora_alpha=256, lora_dropout=0.05):
        ...

    def generate(self, pixel_values, input_ids, attention_mask,
                 generation_config=None, **generate_kwargs):
        ...

    def forward(self, pixel_values, output_hidden_states=None,
                return_dict=None):
        ...

class InternVL_C(InternVLModel):
    def forward(self, image, text):
        ...

class InternVL_G(InternVLModel):
    def forward(self, image, text):
        ...

Import

from internvl_chat_llava.llava.model.multimodal_encoder.internvl_14b.modeling_internvl import (
    InternVLModel,
    InternVL_C,
    InternVL_G,
)

I/O Contract

Inputs

Name Type Required Description
pixel_values torch.FloatTensor [batch, 3, height, width] Yes Input images for the vision encoder
input_ids torch.LongTensor [batch, seq_len] Yes (generate) Text token IDs
attention_mask torch.LongTensor [batch, seq_len] Yes (generate) Attention mask for text tokens
generation_config GenerationConfig No HuggingFace generation configuration
output_hidden_states bool No Whether to return all hidden states
return_dict bool No Whether to return ModelOutput

Outputs

Name Type Description
vision_outputs BaseModelOutputWithPooling Vision encoder outputs with last_hidden_state and pooler_output
outputs torch.Tensor Query-processed hidden states from QLLaMA
logits_per_image torch.FloatTensor Image-text similarity scores (InternVL_C/G)
logits_per_text torch.FloatTensor Text-image similarity scores (InternVL_C/G)

Usage Examples

Basic Usage

from internvl_chat_llava.llava.model.multimodal_encoder.internvl_14b.modeling_internvl import (
    InternVLModel
)

# Load pretrained model
model = InternVLModel.from_pretrained("OpenGVLab/InternVL-14B")

# Apply LoRA for fine-tuning
model.wrap_backbone_lora(r=128, lora_alpha=256)
model.wrap_qllama_lora(r=128, lora_alpha=256)

# Forward pass (returns vision features + query-processed features)
vision_outputs, query_outputs = model(pixel_values=images)

# Generation
outputs = model.generate(
    pixel_values=images,
    input_ids=input_ids,
    attention_mask=attention_mask,
    generation_config=gen_config,
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment