Principle:OpenGVLab InternVL Layer Wise LR Decay

Knowledge Sources	OpenGVLab_InternVL
Domains	Training, Optimization, Fine-tuning
Last Updated	2026-02-07 14:00 GMT

Overview

The layer-wise learning rate decay principle applies exponentially decreasing learning rates to deeper pretrained layers, preventing catastrophic forgetting of early visual features during fine-tuning.

Description

When fine-tuning deep Vision Transformer backbones (such as InternViT-6B with 48 layers), using a uniform learning rate for all layers can cause catastrophic forgetting of pretrained features, especially in the early layers that capture low-level visual patterns. Layer-wise LR decay addresses this by assigning each layer a learning rate proportional to its depth:

lr(layer) = base_lr * decay_rate^(num_layers - layer_id - 1)

This means:

Early layers (embeddings, first transformer blocks) receive the smallest learning rates, preserving their pretrained features.
Later layers receive progressively larger learning rates, allowing them to adapt more aggressively to the downstream task.
The decode head and other task-specific parameters receive the full base learning rate.

Layer assignment maps parameters to layer IDs: embedding parameters (cls_token, pos_embed, patch_embed) go to layer 0, transformer block N goes to layer N+1, and non-backbone parameters go to the last layer. Parameters are also separated into weight-decay and no-decay groups (bias terms, 1D parameters).

Usage

Apply this principle when fine-tuning pretrained ViT backbones for downstream tasks like semantic segmentation, where preserving low-level visual features is critical.

Theoretical Basis

Layer-wise learning rate decay was popularized by the BEiT paper (Bao et al., 2021) and has become standard practice for fine-tuning large pretrained vision transformers. The exponential decay schedule is motivated by the observation that early layers learn more universal features while later layers are more task-specific, thus early layers need less adaptation.

Related Pages

Implementation:OpenGVLab_InternVL_CustomLayerDecayOptimizerConstructor

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment