Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:OpenGVLab InternVL Segmentation Decode Head

From Leeroopedia


Knowledge Sources
Domains Segmentation, Decode Head, Model Architecture
Last Updated 2026-02-07 14:00 GMT

Overview

The segmentation decode head principle defines customizable classification heads for semantic segmentation that support linear probing, mixed-precision stability, and flexible convolution configurations.

Description

In semantic segmentation, the decode head is the component that transforms backbone features into per-pixel class predictions. This principle addresses specific challenges when using large vision transformer backbones like InternViT-6B:

  • Linear probing mode (zero-conv): Setting num_convs=0 replaces all convolutional layers with an identity mapping, making the decode head a simple linear classifier. This is essential for evaluating the quality of pretrained backbone features without any additional learned transformations.
  • Mixed-precision stability: When backbones operate in BFloat16 for efficiency, the decode head explicitly casts to FP32 before processing to prevent numerical instability in the classification layer. This is critical because segmentation loss computation is sensitive to precision.
  • Configurable architecture: The head supports variable numbers of convolution layers, kernel sizes, dilation rates, and optional concat_input (concatenating input features with processed features before classification).
  • Optional normalization: SyncBatchNorm can be optionally added for multi-GPU consistency, controlled by the with_norm parameter.
  • Force-registration: The custom head overrides the default MMSeg FCNHead to ensure these InternVL-specific modifications are always active.

Usage

Apply this principle when building segmentation decode heads for InternVL backbone models, especially for linear probing experiments and when using mixed-precision training.

Theoretical Basis

Linear probing is a standard evaluation protocol for pretrained representations, measuring feature quality without confounding from the decoder's capacity. The FP32 casting addresses the well-known numerical stability issues when combining BFloat16 features with cross-entropy loss computation.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment