Principle:OpenGVLab InternVL Segmentation Decode Head
| Knowledge Sources | |
|---|---|
| Domains | Segmentation, Decode Head, Model Architecture |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
The segmentation decode head principle defines customizable classification heads for semantic segmentation that support linear probing, mixed-precision stability, and flexible convolution configurations.
Description
In semantic segmentation, the decode head is the component that transforms backbone features into per-pixel class predictions. This principle addresses specific challenges when using large vision transformer backbones like InternViT-6B:
- Linear probing mode (zero-conv): Setting num_convs=0 replaces all convolutional layers with an identity mapping, making the decode head a simple linear classifier. This is essential for evaluating the quality of pretrained backbone features without any additional learned transformations.
- Mixed-precision stability: When backbones operate in BFloat16 for efficiency, the decode head explicitly casts to FP32 before processing to prevent numerical instability in the classification layer. This is critical because segmentation loss computation is sensitive to precision.
- Configurable architecture: The head supports variable numbers of convolution layers, kernel sizes, dilation rates, and optional concat_input (concatenating input features with processed features before classification).
- Optional normalization: SyncBatchNorm can be optionally added for multi-GPU consistency, controlled by the with_norm parameter.
- Force-registration: The custom head overrides the default MMSeg FCNHead to ensure these InternVL-specific modifications are always active.
Usage
Apply this principle when building segmentation decode heads for InternVL backbone models, especially for linear probing experiments and when using mixed-precision training.
Theoretical Basis
Linear probing is a standard evaluation protocol for pretrained representations, measuring feature quality without confounding from the decoder's capacity. The FP32 casting addresses the well-known numerical stability issues when combining BFloat16 features with cross-entropy loss computation.