Implementation:OpenGVLab InternVL InternViT 6B Segmentation Backbone
| Knowledge Sources | |
|---|---|
| Domains | Vision Transformer, Semantic Segmentation, Backbone Architecture |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
This module implements the InternViT-6B vision transformer backbone as an MMSegmentation-compatible backbone for semantic segmentation tasks.
Description
The intern_vit_6b.py file defines the complete InternViT6B architecture, a 6-billion-parameter vision transformer registered as an MMSegmentation backbone via @BACKBONES.register_module(). The architecture features:
Core Components:
- PatchEmbed: Converts 2D images to patch embeddings using a convolutional projection (default patch_size=14)
- Block: Transformer blocks with pre-norm architecture using RMSNorm (with optional Apex FusedRMSNorm), LayerScale for residual scaling, DropPath for stochastic depth, and gradient checkpointing (with_cp)
- Attention: Multi-head self-attention supporting both naive attention and FlashAttention with optional QK normalization for training stability at scale
- Mlp: Standard two-layer MLP with GELU activation
Architecture Configuration (defaults):
- embed_dim=3200, num_heads=25, depth=48, mlp_ratio=4, init_values=0.1
- RMSNorm for block normalization, QK normalization enabled
- LayerScale with configurable force_fp32 mode
- Learnable CLS token and position embeddings
Key Features:
- Multi-level feature extraction via configurable out_indices (e.g., [7, 11, 15, 23]) for UPerNet-style decoders
- Optional FPN neck with ConvTranspose2d upsampling layers (up1 through up4) producing multi-scale feature maps
- Bicubic position embedding interpolation for flexible input resolutions different from pretrain_size
- Patch embedding interpolation for different patch sizes at init_weights time
- BFloat16 default precision (self.to(torch.bfloat16) at initialization)
- freeze_vit support for frozen backbone evaluation
Usage
Use this module as the vision backbone in MMSegmentation configurations for semantic segmentation experiments with InternVL. It is the central backbone that all InternVL segmentation configs build upon, typically paired with UPerNet or similar decoders.
Code Reference
Source Location
- Repository: OpenGVLab_InternVL
- File: segmentation/mmseg_custom/models/backbones/intern_vit_6b.py
- Lines: 1-417
Signature
@BACKBONES.register_module()
class InternViT6B(BaseModule):
def __init__(self, in_chans=3, patch_size=14, img_size=224,
pretrain_size=224, qkv_bias=False, drop_path_rate=0.0,
embed_dim=3200, num_heads=25, mlp_ratio=4,
init_values=0.1, qk_normalization=True, depth=48,
use_flash_attn=True, with_cp=True,
layerscale_force_fp32=False,
out_indices=[7, 11, 15, 23],
freeze_vit=False, with_fpn=False,
with_final_norm=False, pretrained=None): ...
def forward(self, x) -> list[Tensor]: ...
class Attention(nn.Module):
def __init__(self, dim, num_heads=8, qkv_bias=False,
use_flash_attn=False, qk_normalization=False): ...
class Block(nn.Module):
def __init__(self, dim, num_heads, mlp_ratio=4.,
drop_path=0., init_values=None,
use_flash_attn=False, with_cp=False): ...
Import
from mmseg_custom.models.backbones.intern_vit_6b import InternViT6B
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| x | Tensor | Yes | Input image tensor of shape (B, C, H, W) |
| img_size | int | No | Input image size (default: 224) |
| patch_size | int | No | Patch size for tokenization (default: 14) |
| embed_dim | int | No | Embedding dimension (default: 3200) |
| depth | int | No | Number of transformer blocks (default: 48) |
| out_indices | list[int] | No | Block indices for multi-level feature output (default: [7, 11, 15, 23]) |
| pretrained | str | No | Path to pretrained checkpoint for weight initialization |
| freeze_vit | bool | No | Freeze all backbone parameters (default: False) |
| with_fpn | bool | No | Add FPN neck with ConvTranspose2d upsampling (default: False) |
Outputs
| Name | Type | Description |
|---|---|---|
| features | list[Tensor] | List of feature maps from specified out_indices, each of shape (B, embed_dim, H/patch_size, W/patch_size), cast to float32. With FPN: 4 multi-scale feature maps at different resolutions. |
Usage Examples
Basic Usage
# In MMSegmentation config:
# model = dict(
# type='EncoderDecoder',
# backbone=dict(
# type='InternViT6B',
# pretrain_size=224,
# img_size=448,
# patch_size=14,
# embed_dim=3200,
# depth=48,
# num_heads=25,
# mlp_ratio=4,
# qkv_bias=False,
# qk_normalization=True,
# use_flash_attn=True,
# with_cp=True,
# out_indices=[7, 15, 23, 47],
# with_fpn=True,
# freeze_vit=True,
# pretrained='path/to/internvit-6b.pth',
# ),
# decode_head=dict(type='UPerHead', ...),
# )