Principle:OpenGVLab InternVL Layer Wise LR Decay
| Knowledge Sources | |
|---|---|
| Domains | Training, Optimization, Fine-tuning |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
The layer-wise learning rate decay principle applies exponentially decreasing learning rates to deeper pretrained layers, preventing catastrophic forgetting of early visual features during fine-tuning.
Description
When fine-tuning deep Vision Transformer backbones (such as InternViT-6B with 48 layers), using a uniform learning rate for all layers can cause catastrophic forgetting of pretrained features, especially in the early layers that capture low-level visual patterns. Layer-wise LR decay addresses this by assigning each layer a learning rate proportional to its depth:
lr(layer) = base_lr * decay_rate^(num_layers - layer_id - 1)
This means:
- Early layers (embeddings, first transformer blocks) receive the smallest learning rates, preserving their pretrained features.
- Later layers receive progressively larger learning rates, allowing them to adapt more aggressively to the downstream task.
- The decode head and other task-specific parameters receive the full base learning rate.
Layer assignment maps parameters to layer IDs: embedding parameters (cls_token, pos_embed, patch_embed) go to layer 0, transformer block N goes to layer N+1, and non-backbone parameters go to the last layer. Parameters are also separated into weight-decay and no-decay groups (bias terms, 1D parameters).
Usage
Apply this principle when fine-tuning pretrained ViT backbones for downstream tasks like semantic segmentation, where preserving low-level visual features is critical.
Theoretical Basis
Layer-wise learning rate decay was popularized by the BEiT paper (Bao et al., 2021) and has become standard practice for fine-tuning large pretrained vision transformers. The exponential decay schedule is motivated by the observation that early layers learn more universal features while later layers are more task-specific, thus early layers need less adaptation.