Implementation:OpenGVLab InternVL InternLM2 Model
| Knowledge Sources | |
|---|---|
| Domains | Language Model, Transformer Architecture, Causal LM |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
Full PyTorch implementation of the InternLM2 transformer language model, providing the decoder-only causal LM backbone for InternVL multimodal chat, with support for eager and Flash Attention, Grouped Query Attention, RoPE scaling variants, and interactive chat/streaming interfaces.
Description
This module implements the complete InternLM2 decoder-only transformer architecture adapted from LLaMA with several key modifications:
Fused QKV Projection: The attention layer uses a single wqkv linear projection that packs query, key, and value tensors, enabling Grouped Query Attention (GQA) where the number of key-value heads is configurable independently of query heads.
Normalization: Uses InternLM2RMSNorm (equivalent to T5LayerNorm) with optional Apex FusedRMSNorm acceleration when available.
Positional Encoding: Implements Rotary Position Embeddings (RoPE) with three variants: base (InternLM2RotaryEmbedding), linear scaling (InternLM2LinearScalingRotaryEmbedding), and Dynamic NTK scaling (InternLM2DynamicNTKScalingRotaryEmbedding) for extended context length support.
MLP: A gated SiLU feedforward network with three weight matrices (w1, w2, w3) following the SwiGLU pattern.
Attention Backends: Provides both eager attention (InternLM2Attention) with standard softmax computation and Flash Attention 2 (InternLM2FlashAttention2) with automatic padding/unpadding for variable-length sequences.
Model Heads: Three model variants are provided: InternLM2Model (base transformer), InternLM2ForCausalLM (with language model head, chat, and stream_chat methods using the im_start/im_end template), and InternLM2ForSequenceClassification.
Usage
Use this module as the language model backbone within the InternVL multimodal architecture. The InternLM2ForCausalLM class is typically instantiated via from_pretrained and processes the combined visual and textual token embeddings to generate multimodal responses. The chat and stream_chat methods provide convenient interfaces for interactive use.
Code Reference
Source Location
- Repository: OpenGVLab_InternVL
- File: internvl_chat/internvl/model/internlm2/modeling_internlm2.py
- Lines: 1-1429
Signature
class InternLM2Model(InternLM2PreTrainedModel):
def forward(self, input_ids=None, attention_mask=None, position_ids=None,
past_key_values=None, inputs_embeds=None, use_cache=None,
output_attentions=None, output_hidden_states=None,
return_dict=None) -> Union[Tuple, BaseModelOutputWithPast]:
...
class InternLM2ForCausalLM(InternLM2PreTrainedModel):
def forward(self, input_ids=None, attention_mask=None, position_ids=None,
past_key_values=None, inputs_embeds=None, labels=None,
use_cache=None, output_attentions=None,
output_hidden_states=None, return_dict=None
) -> Union[Tuple, CausalLMOutputWithPast]:
...
def chat(self, tokenizer, query, history=[], streamer=None,
max_new_tokens=1024, do_sample=True, temperature=0.8, top_p=0.8,
meta_instruction='...', **kwargs):
...
def stream_chat(self, tokenizer, query, history=[], max_new_tokens=1024,
do_sample=True, temperature=0.8, top_p=0.8, **kwargs):
...
class InternLM2ForSequenceClassification(InternLM2PreTrainedModel):
def forward(self, input_ids=None, ..., labels=None, ...
) -> Union[Tuple, SequenceClassifierOutputWithPast]:
...
Import
from internvl.model.internlm2.modeling_internlm2 import InternLM2ForCausalLM
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| input_ids | torch.LongTensor | No | Token indices of shape (batch_size, seq_length); mutually exclusive with inputs_embeds |
| attention_mask | torch.Tensor | No | Mask of shape (batch_size, seq_length); 1 for non-masked, 0 for masked tokens |
| position_ids | torch.LongTensor | No | Position indices for RoPE computation |
| past_key_values | Tuple[Tuple[torch.FloatTensor]] | No | Cached key-value states for autoregressive decoding |
| inputs_embeds | torch.FloatTensor | No | Pre-computed input embeddings (used by multimodal pipeline) |
| labels | torch.LongTensor | No | Token labels for loss computation; -100 for ignored positions |
| use_cache | bool | No | Whether to return cached key-value states |
Outputs
| Name | Type | Description |
|---|---|---|
| loss | torch.FloatTensor | Cross-entropy loss (only when labels provided) |
| logits | torch.FloatTensor | Token prediction logits of shape (batch_size, seq_length, vocab_size) |
| past_key_values | Tuple | Cached key-value states for next decoding step |
| hidden_states | Tuple[torch.FloatTensor] | Hidden states from all layers (when output_hidden_states=True) |
| attentions | Tuple[torch.FloatTensor] | Attention weights from all layers (when output_attentions=True) |
Usage Examples
Basic Usage
from transformers import AutoTokenizer
from internvl.model.internlm2.modeling_internlm2 import InternLM2ForCausalLM
model = InternLM2ForCausalLM.from_pretrained("path/to/internlm2")
tokenizer = AutoTokenizer.from_pretrained("path/to/internlm2")
# Interactive chat
response, history = model.chat(tokenizer, "What is machine learning?")
print(response)
# Streaming chat
for response, history in model.stream_chat(tokenizer, "Tell me about AI"):
print(response)