Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:OpenGVLab InternVL InternLM2 Model

From Leeroopedia
Revision as of 16:14, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/OpenGVLab_InternVL_InternLM2_Model.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains Language Model, Transformer Architecture, Causal LM
Last Updated 2026-02-07 14:00 GMT

Overview

Full PyTorch implementation of the InternLM2 transformer language model, providing the decoder-only causal LM backbone for InternVL multimodal chat, with support for eager and Flash Attention, Grouped Query Attention, RoPE scaling variants, and interactive chat/streaming interfaces.

Description

This module implements the complete InternLM2 decoder-only transformer architecture adapted from LLaMA with several key modifications:

Fused QKV Projection: The attention layer uses a single wqkv linear projection that packs query, key, and value tensors, enabling Grouped Query Attention (GQA) where the number of key-value heads is configurable independently of query heads.

Normalization: Uses InternLM2RMSNorm (equivalent to T5LayerNorm) with optional Apex FusedRMSNorm acceleration when available.

Positional Encoding: Implements Rotary Position Embeddings (RoPE) with three variants: base (InternLM2RotaryEmbedding), linear scaling (InternLM2LinearScalingRotaryEmbedding), and Dynamic NTK scaling (InternLM2DynamicNTKScalingRotaryEmbedding) for extended context length support.

MLP: A gated SiLU feedforward network with three weight matrices (w1, w2, w3) following the SwiGLU pattern.

Attention Backends: Provides both eager attention (InternLM2Attention) with standard softmax computation and Flash Attention 2 (InternLM2FlashAttention2) with automatic padding/unpadding for variable-length sequences.

Model Heads: Three model variants are provided: InternLM2Model (base transformer), InternLM2ForCausalLM (with language model head, chat, and stream_chat methods using the im_start/im_end template), and InternLM2ForSequenceClassification.

Usage

Use this module as the language model backbone within the InternVL multimodal architecture. The InternLM2ForCausalLM class is typically instantiated via from_pretrained and processes the combined visual and textual token embeddings to generate multimodal responses. The chat and stream_chat methods provide convenient interfaces for interactive use.

Code Reference

Source Location

Signature

class InternLM2Model(InternLM2PreTrainedModel):
    def forward(self, input_ids=None, attention_mask=None, position_ids=None,
                past_key_values=None, inputs_embeds=None, use_cache=None,
                output_attentions=None, output_hidden_states=None,
                return_dict=None) -> Union[Tuple, BaseModelOutputWithPast]:
        ...

class InternLM2ForCausalLM(InternLM2PreTrainedModel):
    def forward(self, input_ids=None, attention_mask=None, position_ids=None,
                past_key_values=None, inputs_embeds=None, labels=None,
                use_cache=None, output_attentions=None,
                output_hidden_states=None, return_dict=None
                ) -> Union[Tuple, CausalLMOutputWithPast]:
        ...

    def chat(self, tokenizer, query, history=[], streamer=None,
             max_new_tokens=1024, do_sample=True, temperature=0.8, top_p=0.8,
             meta_instruction='...', **kwargs):
        ...

    def stream_chat(self, tokenizer, query, history=[], max_new_tokens=1024,
                    do_sample=True, temperature=0.8, top_p=0.8, **kwargs):
        ...

class InternLM2ForSequenceClassification(InternLM2PreTrainedModel):
    def forward(self, input_ids=None, ..., labels=None, ...
                ) -> Union[Tuple, SequenceClassifierOutputWithPast]:
        ...

Import

from internvl.model.internlm2.modeling_internlm2 import InternLM2ForCausalLM

I/O Contract

Inputs

Name Type Required Description
input_ids torch.LongTensor No Token indices of shape (batch_size, seq_length); mutually exclusive with inputs_embeds
attention_mask torch.Tensor No Mask of shape (batch_size, seq_length); 1 for non-masked, 0 for masked tokens
position_ids torch.LongTensor No Position indices for RoPE computation
past_key_values Tuple[Tuple[torch.FloatTensor]] No Cached key-value states for autoregressive decoding
inputs_embeds torch.FloatTensor No Pre-computed input embeddings (used by multimodal pipeline)
labels torch.LongTensor No Token labels for loss computation; -100 for ignored positions
use_cache bool No Whether to return cached key-value states

Outputs

Name Type Description
loss torch.FloatTensor Cross-entropy loss (only when labels provided)
logits torch.FloatTensor Token prediction logits of shape (batch_size, seq_length, vocab_size)
past_key_values Tuple Cached key-value states for next decoding step
hidden_states Tuple[torch.FloatTensor] Hidden states from all layers (when output_hidden_states=True)
attentions Tuple[torch.FloatTensor] Attention weights from all layers (when output_attentions=True)

Usage Examples

Basic Usage

from transformers import AutoTokenizer
from internvl.model.internlm2.modeling_internlm2 import InternLM2ForCausalLM

model = InternLM2ForCausalLM.from_pretrained("path/to/internlm2")
tokenizer = AutoTokenizer.from_pretrained("path/to/internlm2")

# Interactive chat
response, history = model.chat(tokenizer, "What is machine learning?")
print(response)

# Streaming chat
for response, history in model.stream_chat(tokenizer, "Tell me about AI"):
    print(response)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment