Principle:OpenGVLab InternVL Vision Transformer Backbone
| Knowledge Sources | |
|---|---|
| Domains | Vision Transformer, Backbone Architecture, Semantic Segmentation |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
Vision Transformer Backbone is the principle of using large-scale vision transformers as feature extractors for dense prediction tasks like semantic segmentation, with multi-level feature extraction and flexible resolution handling.
Description
The InternViT-6B backbone demonstrates several key design principles for adapting vision transformers to dense prediction:
- Multi-level feature extraction: By extracting features at multiple intermediate block indices (e.g., blocks 7, 11, 15, 23 out of 48), the backbone provides a hierarchy of features at different semantic levels, analogous to the multi-scale feature pyramids from convolutional backbones. This is essential for decoder architectures like UPerNet that fuse features at multiple resolutions.
- Position embedding interpolation: Bicubic interpolation of learned position embeddings allows the model to accept input resolutions different from the pretraining resolution. This enables training at higher resolutions (e.g., 448px) using weights pretrained at lower resolutions (e.g., 224px) without retraining the position embeddings from scratch.
- Training stability at scale: At 6 billion parameters, several techniques are employed for stable training: RMSNorm (more numerically stable than LayerNorm), QK normalization (normalizing query and key vectors before attention computation to prevent attention logit explosion), LayerScale (learnable per-channel residual scaling initialized to small values), and BFloat16 precision (wider dynamic range than Float16).
- Efficient computation: FlashAttention provides memory-efficient attention computation with IO-awareness, and gradient checkpointing (with_cp) trades compute for memory by recomputing activations during the backward pass.
- Optional FPN integration: A Feature Pyramid Network neck using ConvTranspose2d upsampling converts single-resolution transformer features into multi-scale feature maps compatible with standard segmentation decoders.
Usage
Apply this principle when using large vision transformers for dense prediction tasks. The multi-level extraction, position embedding interpolation, and training stability techniques are broadly applicable to any ViT-based segmentation or detection system.
Theoretical Basis
Vision transformers process images as sequences of patch tokens, losing the inherent multi-scale structure of convolutional networks. The multi-level extraction strategy recovers this by treating different transformer depths as analogous to different "stages" in a ConvNet. Position embedding interpolation leverages the smoothness of learned positional encodings -- nearby positions have similar embeddings, so bicubic interpolation produces reasonable initializations for unseen positions. QK normalization addresses the observation that in very large transformers, the dot product of queries and keys can grow proportionally to the hidden dimension, causing attention distributions to become overly peaked.