Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:OpenGVLab InternVL Layer Wise LR Decay

From Leeroopedia


Knowledge Sources
Domains Training, Optimization, Fine-tuning
Last Updated 2026-02-07 14:00 GMT

Overview

The layer-wise learning rate decay principle applies exponentially decreasing learning rates to deeper pretrained layers, preventing catastrophic forgetting of early visual features during fine-tuning.

Description

When fine-tuning deep Vision Transformer backbones (such as InternViT-6B with 48 layers), using a uniform learning rate for all layers can cause catastrophic forgetting of pretrained features, especially in the early layers that capture low-level visual patterns. Layer-wise LR decay addresses this by assigning each layer a learning rate proportional to its depth:

lr(layer) = base_lr * decay_rate^(num_layers - layer_id - 1)

This means:

  • Early layers (embeddings, first transformer blocks) receive the smallest learning rates, preserving their pretrained features.
  • Later layers receive progressively larger learning rates, allowing them to adapt more aggressively to the downstream task.
  • The decode head and other task-specific parameters receive the full base learning rate.

Layer assignment maps parameters to layer IDs: embedding parameters (cls_token, pos_embed, patch_embed) go to layer 0, transformer block N goes to layer N+1, and non-backbone parameters go to the last layer. Parameters are also separated into weight-decay and no-decay groups (bias terms, 1D parameters).

Usage

Apply this principle when fine-tuning pretrained ViT backbones for downstream tasks like semantic segmentation, where preserving low-level visual features is critical.

Theoretical Basis

Layer-wise learning rate decay was popularized by the BEiT paper (Bao et al., 2021) and has become standard practice for fine-tuning large pretrained vision transformers. The exponential decay schedule is motivated by the observation that early layers learn more universal features while later layers are more task-specific, thus early layers need less adaptation.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment