Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:OpenGVLab InternVL InternViT 6B Segmentation Backbone

From Leeroopedia


Knowledge Sources
Domains Vision Transformer, Semantic Segmentation, Backbone Architecture
Last Updated 2026-02-07 14:00 GMT

Overview

This module implements the InternViT-6B vision transformer backbone as an MMSegmentation-compatible backbone for semantic segmentation tasks.

Description

The intern_vit_6b.py file defines the complete InternViT6B architecture, a 6-billion-parameter vision transformer registered as an MMSegmentation backbone via @BACKBONES.register_module(). The architecture features:

Core Components:

  • PatchEmbed: Converts 2D images to patch embeddings using a convolutional projection (default patch_size=14)
  • Block: Transformer blocks with pre-norm architecture using RMSNorm (with optional Apex FusedRMSNorm), LayerScale for residual scaling, DropPath for stochastic depth, and gradient checkpointing (with_cp)
  • Attention: Multi-head self-attention supporting both naive attention and FlashAttention with optional QK normalization for training stability at scale
  • Mlp: Standard two-layer MLP with GELU activation

Architecture Configuration (defaults):

  • embed_dim=3200, num_heads=25, depth=48, mlp_ratio=4, init_values=0.1
  • RMSNorm for block normalization, QK normalization enabled
  • LayerScale with configurable force_fp32 mode
  • Learnable CLS token and position embeddings

Key Features:

  • Multi-level feature extraction via configurable out_indices (e.g., [7, 11, 15, 23]) for UPerNet-style decoders
  • Optional FPN neck with ConvTranspose2d upsampling layers (up1 through up4) producing multi-scale feature maps
  • Bicubic position embedding interpolation for flexible input resolutions different from pretrain_size
  • Patch embedding interpolation for different patch sizes at init_weights time
  • BFloat16 default precision (self.to(torch.bfloat16) at initialization)
  • freeze_vit support for frozen backbone evaluation

Usage

Use this module as the vision backbone in MMSegmentation configurations for semantic segmentation experiments with InternVL. It is the central backbone that all InternVL segmentation configs build upon, typically paired with UPerNet or similar decoders.

Code Reference

Source Location

Signature

@BACKBONES.register_module()
class InternViT6B(BaseModule):
    def __init__(self, in_chans=3, patch_size=14, img_size=224,
                 pretrain_size=224, qkv_bias=False, drop_path_rate=0.0,
                 embed_dim=3200, num_heads=25, mlp_ratio=4,
                 init_values=0.1, qk_normalization=True, depth=48,
                 use_flash_attn=True, with_cp=True,
                 layerscale_force_fp32=False,
                 out_indices=[7, 11, 15, 23],
                 freeze_vit=False, with_fpn=False,
                 with_final_norm=False, pretrained=None): ...

    def forward(self, x) -> list[Tensor]: ...

class Attention(nn.Module):
    def __init__(self, dim, num_heads=8, qkv_bias=False,
                 use_flash_attn=False, qk_normalization=False): ...

class Block(nn.Module):
    def __init__(self, dim, num_heads, mlp_ratio=4.,
                 drop_path=0., init_values=None,
                 use_flash_attn=False, with_cp=False): ...

Import

from mmseg_custom.models.backbones.intern_vit_6b import InternViT6B

I/O Contract

Inputs

Name Type Required Description
x Tensor Yes Input image tensor of shape (B, C, H, W)
img_size int No Input image size (default: 224)
patch_size int No Patch size for tokenization (default: 14)
embed_dim int No Embedding dimension (default: 3200)
depth int No Number of transformer blocks (default: 48)
out_indices list[int] No Block indices for multi-level feature output (default: [7, 11, 15, 23])
pretrained str No Path to pretrained checkpoint for weight initialization
freeze_vit bool No Freeze all backbone parameters (default: False)
with_fpn bool No Add FPN neck with ConvTranspose2d upsampling (default: False)

Outputs

Name Type Description
features list[Tensor] List of feature maps from specified out_indices, each of shape (B, embed_dim, H/patch_size, W/patch_size), cast to float32. With FPN: 4 multi-scale feature maps at different resolutions.

Usage Examples

Basic Usage

# In MMSegmentation config:
# model = dict(
#     type='EncoderDecoder',
#     backbone=dict(
#         type='InternViT6B',
#         pretrain_size=224,
#         img_size=448,
#         patch_size=14,
#         embed_dim=3200,
#         depth=48,
#         num_heads=25,
#         mlp_ratio=4,
#         qkv_bias=False,
#         qk_normalization=True,
#         use_flash_attn=True,
#         with_cp=True,
#         out_indices=[7, 15, 23, 47],
#         with_fpn=True,
#         freeze_vit=True,
#         pretrained='path/to/internvit-6b.pth',
#     ),
#     decode_head=dict(type='UPerHead', ...),
# )

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment