Principle:OpenGVLab InternVL Transformer Decoder Architecture
| Knowledge Sources | |
|---|---|
| Domains | Language Model, Transformer Architecture, Attention Mechanism |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
The decoder-only transformer architecture pattern used across multiple language model backends in InternVL, employing pre-norm residual connections, RoPE positional encoding, Grouped Query Attention (GQA), and gated SiLU MLPs for autoregressive language generation.
Description
InternVL supports multiple language model backbones (InternLM2, Phi-3, MPT) that share a common decoder-only transformer design pattern. This principle describes the architectural elements shared across these implementations:
Pre-Norm Residual Blocks: Each decoder layer applies RMSNorm before the attention and feedforward sublayers, with residual connections that add the sublayer output to the original input. This pre-norm design improves training stability compared to post-norm approaches.
Rotary Position Embeddings (RoPE): Position information is injected by rotating query and key vectors using sinusoidal frequency bases. Multiple scaling variants extend the effective context length: linear scaling (dividing positions by a factor), Dynamic NTK scaling (adjusting the frequency base), SU scaling (per-dimension short/long frequency factors), and YaRN scaling (logarithmic scaling factor).
Grouped Query Attention (GQA): Rather than the standard multi-head attention where each head has independent K, V projections, GQA shares key-value heads across groups of query heads. This reduces KV cache memory during inference. The extreme case is Multi-Query Attention (used in MPT) where all query heads share a single key-value head.
Gated MLP: The feedforward network uses the SwiGLU pattern: input is projected to an intermediate dimension through two parallel paths (gate and up-projection), the gate path applies a SiLU activation, the results are element-wise multiplied, and a down-projection reduces back to model dimension.
Attention Backends: Implementations typically provide multiple attention computation backends: eager (standard PyTorch matmul with explicit softmax), Flash Attention 2 (fused kernel with IO-aware memory management), SDPA (PyTorch's native scaled_dot_product_attention), and Triton-based (custom compiled kernels).
Usage
Apply this architectural pattern when implementing or understanding the language model components of InternVL. The pattern governs how textual and visual token embeddings are processed to generate multimodal responses. Each backend (InternLM2, Phi-3, MPT) follows this pattern with specific variations in projection organization, normalization placement, and attention implementation.
Theoretical Basis
The decoder-only transformer architecture originates from the GPT family of models and has become the dominant paradigm for large language models. Key theoretical foundations include:
- Attention Is All You Need (Vaswani et al., 2017) for the base transformer architecture
- RoPE (Su et al., 2021) for rotary position embeddings
- GQA (Ainslie et al., 2023) for grouped query attention
- SwiGLU (Shazeer, 2020) for gated linear unit feedforward networks
- FlashAttention (Dao et al., 2022) for IO-aware exact attention computation
- ALiBi (Press et al., 2022) for attention with linear biases (used in MPT)