Implementation:OpenGVLab InternVL InternViT 6B Segmentation Backbone

Knowledge Sources	OpenGVLab_InternVL
Domains	Vision Transformer, Semantic Segmentation, Backbone Architecture
Last Updated	2026-02-07 14:00 GMT

Overview

This module implements the InternViT-6B vision transformer backbone as an MMSegmentation-compatible backbone for semantic segmentation tasks.

Description

The intern_vit_6b.py file defines the complete InternViT6B architecture, a 6-billion-parameter vision transformer registered as an MMSegmentation backbone via @BACKBONES.register_module(). The architecture features:

Core Components:

PatchEmbed: Converts 2D images to patch embeddings using a convolutional projection (default patch_size=14)
Block: Transformer blocks with pre-norm architecture using RMSNorm (with optional Apex FusedRMSNorm), LayerScale for residual scaling, DropPath for stochastic depth, and gradient checkpointing (with_cp)
Attention: Multi-head self-attention supporting both naive attention and FlashAttention with optional QK normalization for training stability at scale
Mlp: Standard two-layer MLP with GELU activation

Architecture Configuration (defaults):

embed_dim=3200, num_heads=25, depth=48, mlp_ratio=4, init_values=0.1
RMSNorm for block normalization, QK normalization enabled
LayerScale with configurable force_fp32 mode
Learnable CLS token and position embeddings

Key Features:

Multi-level feature extraction via configurable out_indices (e.g., [7, 11, 15, 23]) for UPerNet-style decoders
Optional FPN neck with ConvTranspose2d upsampling layers (up1 through up4) producing multi-scale feature maps
Bicubic position embedding interpolation for flexible input resolutions different from pretrain_size
Patch embedding interpolation for different patch sizes at init_weights time
BFloat16 default precision (self.to(torch.bfloat16) at initialization)
freeze_vit support for frozen backbone evaluation

Usage

Use this module as the vision backbone in MMSegmentation configurations for semantic segmentation experiments with InternVL. It is the central backbone that all InternVL segmentation configs build upon, typically paired with UPerNet or similar decoders.

Code Reference

Source Location

Repository: OpenGVLab_InternVL
File: segmentation/mmseg_custom/models/backbones/intern_vit_6b.py
Lines: 1-417

Signature

@BACKBONES.register_module()
class InternViT6B(BaseModule):
    def __init__(self, in_chans=3, patch_size=14, img_size=224,
                 pretrain_size=224, qkv_bias=False, drop_path_rate=0.0,
                 embed_dim=3200, num_heads=25, mlp_ratio=4,
                 init_values=0.1, qk_normalization=True, depth=48,
                 use_flash_attn=True, with_cp=True,
                 layerscale_force_fp32=False,
                 out_indices=[7, 11, 15, 23],
                 freeze_vit=False, with_fpn=False,
                 with_final_norm=False, pretrained=None): ...

    def forward(self, x) -> list[Tensor]: ...

class Attention(nn.Module):
    def __init__(self, dim, num_heads=8, qkv_bias=False,
                 use_flash_attn=False, qk_normalization=False): ...

class Block(nn.Module):
    def __init__(self, dim, num_heads, mlp_ratio=4.,
                 drop_path=0., init_values=None,
                 use_flash_attn=False, with_cp=False): ...

Import

from mmseg_custom.models.backbones.intern_vit_6b import InternViT6B

I/O Contract

Inputs

Name	Type	Required	Description
x	Tensor	Yes	Input image tensor of shape (B, C, H, W)
img_size	int	No	Input image size (default: 224)
patch_size	int	No	Patch size for tokenization (default: 14)
embed_dim	int	No	Embedding dimension (default: 3200)
depth	int	No	Number of transformer blocks (default: 48)
out_indices	list[int]	No	Block indices for multi-level feature output (default: [7, 11, 15, 23])
pretrained	str	No	Path to pretrained checkpoint for weight initialization
freeze_vit	bool	No	Freeze all backbone parameters (default: False)
with_fpn	bool	No	Add FPN neck with ConvTranspose2d upsampling (default: False)

Outputs

Name	Type	Description
features	list[Tensor]	List of feature maps from specified out_indices, each of shape (B, embed_dim, H/patch_size, W/patch_size), cast to float32. With FPN: 4 multi-scale feature maps at different resolutions.

Usage Examples

Basic Usage

# In MMSegmentation config:
# model = dict(
#     type='EncoderDecoder',
#     backbone=dict(
#         type='InternViT6B',
#         pretrain_size=224,
#         img_size=448,
#         patch_size=14,
#         embed_dim=3200,
#         depth=48,
#         num_heads=25,
#         mlp_ratio=4,
#         qkv_bias=False,
#         qk_normalization=True,
#         use_flash_attn=True,
#         with_cp=True,
#         out_indices=[7, 15, 23, 47],
#         with_fpn=True,
#         freeze_vit=True,
#         pretrained='path/to/internvit-6b.pth',
#     ),
#     decode_head=dict(type='UPerHead', ...),
# )

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment