Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:OpenGVLab InternVL Learning Rate Scheduling

From Leeroopedia


Knowledge Sources
Domains Learning Rate Scheduling, Training, Optimization
Last Updated 2026-02-07 14:00 GMT

Overview

Learning Rate Scheduling controls the trajectory of the learning rate during training, using warmup phases followed by decay schedules (cosine, linear, or step) to enable stable convergence.

Description

The learning rate schedule is a critical hyperparameter in deep learning training that determines how the learning rate changes over the course of training. A well-designed schedule helps balance exploration (high LR) with fine convergence (low LR).

The InternVL classification pipeline supports three schedule types:

  • Cosine -- The learning rate follows a cosine curve from the base LR to a minimum LR over the total training steps. This provides a smooth, gradual decay that has been shown to work well for vision transformer training.
  • Linear -- The learning rate linearly decreases from the base LR to lr_min_rate * base_lr, providing a steady, predictable decay.
  • Step -- The learning rate decays by a fixed factor at regular intervals, providing discrete drops in LR.

All schedules include a warmup phase where the LR linearly ramps from a small initial value (warmup_lr) to the base LR over the warmup period. This prevents training instability at the start when model weights are randomly initialized or when using a new dataset.

All schedulers operate in per-iteration mode rather than per-epoch mode, providing fine-grained LR control necessary for large-scale model training.

Usage

Apply learning rate scheduling in classification training pipelines. Select the schedule type based on the training task: cosine is the default and most widely used, step is useful for fine-tuning with specific decay points, and linear provides a simple baseline.

Theoretical Basis

Learning rate scheduling is grounded in optimization theory. A high initial learning rate enables rapid movement through the loss landscape, while decaying the rate allows convergence to sharper, more generalizable minima. The warmup phase is particularly important for transformer architectures where early training dynamics can be unstable due to the self-attention mechanism's sensitivity to initialization.

Cosine scheduling (Loshchilov & Hutter, 2017) has become the de facto standard for vision transformers because its smooth decay avoids the sudden performance fluctuations associated with step schedules while achieving competitive or superior final accuracy.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment