Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:OpenGVLab InternVL Optimizer Construction

From Leeroopedia


Knowledge Sources
Domains Optimization, Training, Weight Decay
Last Updated 2026-02-07 14:00 GMT

Overview

Optimizer Construction builds optimizers with per-parameter weight decay and learning rate settings, supporting layer-wise LR decay, backbone freezing, ZeRO memory optimization, and special handling for normalization and bias parameters.

Description

Large vision models require careful per-parameter optimization strategies to achieve stable and efficient training. The Optimizer Construction principle encompasses several key techniques:

  • Weight decay exclusion -- Normalization layers (1D parameters), biases, and model-specified skip parameters receive zero weight decay. This prevents regularization of parameters that should not be constrained, as weight decay on normalization parameters can harm training stability.
  • Layer-wise LR decay -- Different layers of the model receive different learning rates, typically with lower layers (closer to input) receiving smaller LRs. This is important for fine-tuning pretrained models where earlier features are more general and should be preserved.
  • DCN LR multiplier -- Deformable convolution parameters may need a separate learning rate to account for their different optimization dynamics.
  • Backbone freezing -- Specific backbone levels can have their gradients disabled entirely, enabling transfer learning scenarios where only a classifier head or adapter is trained.
  • ZeRO Redundancy Optimizer -- For distributed training, the optimizer state can be partitioned across workers to reduce per-GPU memory usage, enabling training of larger models.

Usage

Apply this principle when setting up training for large vision or multimodal models. The optimizer should be constructed after the model is initialized, with the configuration specifying which parameters receive special treatment.

Theoretical Basis

Weight decay exclusion for normalization parameters is motivated by the observation that batch/layer normalization parameters control the scale and shift of features, and regularizing them can reduce the effective learning capacity. Layer-wise LR decay follows the intuition from NLP transfer learning (Howard & Ruder, 2018) that lower layers learn more general features that should change less during fine-tuning. ZeRO (Rajbhandari et al., 2020) partitions optimizer states, gradients, and parameters across data-parallel workers to achieve the memory efficiency of model parallelism with the simplicity of data parallelism.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment