Principle:OpenGVLab InternVL Transformer Decoder Architecture

Knowledge Sources	OpenGVLab_InternVL
Domains	Language Model, Transformer Architecture, Attention Mechanism
Last Updated	2026-02-07 14:00 GMT

Overview

The decoder-only transformer architecture pattern used across multiple language model backends in InternVL, employing pre-norm residual connections, RoPE positional encoding, Grouped Query Attention (GQA), and gated SiLU MLPs for autoregressive language generation.

Description

InternVL supports multiple language model backbones (InternLM2, Phi-3, MPT) that share a common decoder-only transformer design pattern. This principle describes the architectural elements shared across these implementations:

Pre-Norm Residual Blocks: Each decoder layer applies RMSNorm before the attention and feedforward sublayers, with residual connections that add the sublayer output to the original input. This pre-norm design improves training stability compared to post-norm approaches.

Rotary Position Embeddings (RoPE): Position information is injected by rotating query and key vectors using sinusoidal frequency bases. Multiple scaling variants extend the effective context length: linear scaling (dividing positions by a factor), Dynamic NTK scaling (adjusting the frequency base), SU scaling (per-dimension short/long frequency factors), and YaRN scaling (logarithmic scaling factor).

Grouped Query Attention (GQA): Rather than the standard multi-head attention where each head has independent K, V projections, GQA shares key-value heads across groups of query heads. This reduces KV cache memory during inference. The extreme case is Multi-Query Attention (used in MPT) where all query heads share a single key-value head.

Gated MLP: The feedforward network uses the SwiGLU pattern: input is projected to an intermediate dimension through two parallel paths (gate and up-projection), the gate path applies a SiLU activation, the results are element-wise multiplied, and a down-projection reduces back to model dimension.

Attention Backends: Implementations typically provide multiple attention computation backends: eager (standard PyTorch matmul with explicit softmax), Flash Attention 2 (fused kernel with IO-aware memory management), SDPA (PyTorch's native scaled_dot_product_attention), and Triton-based (custom compiled kernels).

Usage

Apply this architectural pattern when implementing or understanding the language model components of InternVL. The pattern governs how textual and visual token embeddings are processed to generate multimodal responses. Each backend (InternLM2, Phi-3, MPT) follows this pattern with specific variations in projection organization, normalization placement, and attention implementation.

Theoretical Basis

The decoder-only transformer architecture originates from the GPT family of models and has become the dominant paradigm for large language models. Key theoretical foundations include:

Attention Is All You Need (Vaswani et al., 2017) for the base transformer architecture
RoPE (Su et al., 2021) for rotary position embeddings
GQA (Ainslie et al., 2023) for grouped query attention
SwiGLU (Shazeer, 2020) for gated linear unit feedforward networks
FlashAttention (Dao et al., 2022) for IO-aware exact attention computation
ALiBi (Press et al., 2022) for attention with linear biases (used in MPT)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment