Implementation:OpenGVLab InternVL InternLM2 Model

Knowledge Sources	OpenGVLab_InternVL
Domains	Language Model, Transformer Architecture, Causal LM
Last Updated	2026-02-07 14:00 GMT

Overview

Full PyTorch implementation of the InternLM2 transformer language model, providing the decoder-only causal LM backbone for InternVL multimodal chat, with support for eager and Flash Attention, Grouped Query Attention, RoPE scaling variants, and interactive chat/streaming interfaces.

Description

This module implements the complete InternLM2 decoder-only transformer architecture adapted from LLaMA with several key modifications:

Fused QKV Projection: The attention layer uses a single wqkv linear projection that packs query, key, and value tensors, enabling Grouped Query Attention (GQA) where the number of key-value heads is configurable independently of query heads.

Normalization: Uses InternLM2RMSNorm (equivalent to T5LayerNorm) with optional Apex FusedRMSNorm acceleration when available.

Positional Encoding: Implements Rotary Position Embeddings (RoPE) with three variants: base (InternLM2RotaryEmbedding), linear scaling (InternLM2LinearScalingRotaryEmbedding), and Dynamic NTK scaling (InternLM2DynamicNTKScalingRotaryEmbedding) for extended context length support.

MLP: A gated SiLU feedforward network with three weight matrices (w1, w2, w3) following the SwiGLU pattern.

Attention Backends: Provides both eager attention (InternLM2Attention) with standard softmax computation and Flash Attention 2 (InternLM2FlashAttention2) with automatic padding/unpadding for variable-length sequences.

Model Heads: Three model variants are provided: InternLM2Model (base transformer), InternLM2ForCausalLM (with language model head, chat, and stream_chat methods using the im_start/im_end template), and InternLM2ForSequenceClassification.

Usage

Use this module as the language model backbone within the InternVL multimodal architecture. The InternLM2ForCausalLM class is typically instantiated via from_pretrained and processes the combined visual and textual token embeddings to generate multimodal responses. The chat and stream_chat methods provide convenient interfaces for interactive use.

Code Reference

Source Location

Repository: OpenGVLab_InternVL
File: internvl_chat/internvl/model/internlm2/modeling_internlm2.py
Lines: 1-1429

Signature

class InternLM2Model(InternLM2PreTrainedModel):
    def forward(self, input_ids=None, attention_mask=None, position_ids=None,
                past_key_values=None, inputs_embeds=None, use_cache=None,
                output_attentions=None, output_hidden_states=None,
                return_dict=None) -> Union[Tuple, BaseModelOutputWithPast]:
        ...

class InternLM2ForCausalLM(InternLM2PreTrainedModel):
    def forward(self, input_ids=None, attention_mask=None, position_ids=None,
                past_key_values=None, inputs_embeds=None, labels=None,
                use_cache=None, output_attentions=None,
                output_hidden_states=None, return_dict=None
                ) -> Union[Tuple, CausalLMOutputWithPast]:
        ...

    def chat(self, tokenizer, query, history=[], streamer=None,
             max_new_tokens=1024, do_sample=True, temperature=0.8, top_p=0.8,
             meta_instruction='...', **kwargs):
        ...

    def stream_chat(self, tokenizer, query, history=[], max_new_tokens=1024,
                    do_sample=True, temperature=0.8, top_p=0.8, **kwargs):
        ...

class InternLM2ForSequenceClassification(InternLM2PreTrainedModel):
    def forward(self, input_ids=None, ..., labels=None, ...
                ) -> Union[Tuple, SequenceClassifierOutputWithPast]:
        ...

Import

from internvl.model.internlm2.modeling_internlm2 import InternLM2ForCausalLM

I/O Contract

Inputs

Name	Type	Required	Description
input_ids	torch.LongTensor	No	Token indices of shape (batch_size, seq_length); mutually exclusive with inputs_embeds
attention_mask	torch.Tensor	No	Mask of shape (batch_size, seq_length); 1 for non-masked, 0 for masked tokens
position_ids	torch.LongTensor	No	Position indices for RoPE computation
past_key_values	Tuple[Tuple[torch.FloatTensor]]	No	Cached key-value states for autoregressive decoding
inputs_embeds	torch.FloatTensor	No	Pre-computed input embeddings (used by multimodal pipeline)
labels	torch.LongTensor	No	Token labels for loss computation; -100 for ignored positions
use_cache	bool	No	Whether to return cached key-value states

Outputs

Name	Type	Description
loss	torch.FloatTensor	Cross-entropy loss (only when labels provided)
logits	torch.FloatTensor	Token prediction logits of shape (batch_size, seq_length, vocab_size)
past_key_values	Tuple	Cached key-value states for next decoding step
hidden_states	Tuple[torch.FloatTensor]	Hidden states from all layers (when output_hidden_states=True)
attentions	Tuple[torch.FloatTensor]	Attention weights from all layers (when output_attentions=True)

Usage Examples

Basic Usage

from transformers import AutoTokenizer
from internvl.model.internlm2.modeling_internlm2 import InternLM2ForCausalLM

model = InternLM2ForCausalLM.from_pretrained("path/to/internlm2")
tokenizer = AutoTokenizer.from_pretrained("path/to/internlm2")

# Interactive chat
response, history = model.chat(tokenizer, "What is machine learning?")
print(response)

# Streaming chat
for response, history in model.stream_chat(tokenizer, "Tell me about AI"):
    print(response)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment