Heuristic:Microsoft DeepSpeedExamples LoRA Learning Rate Scaling

Knowledge Sources	Training Script Comments DeepSpeed Team
Domains	Optimization, LLMs, RLHF
Last Updated	2026-02-07 13:00 GMT

Overview

LoRA fine-tuning requires a higher learning rate than full fine-tuning, and should be combined with increased gradient accumulation steps (e.g., 8) and a LoRA dimension of 128 for single-GPU training of 1.3B-6.7B models.

Description

When using Low-Rank Adaptation (LoRA) for parameter-efficient fine-tuning, the effective gradient magnitude is reduced because only a small fraction of parameters are updated. The low-rank projection matrices (A and B) compress the gradient information, requiring a proportionally higher learning rate to achieve the same parameter update magnitude as full fine-tuning. The DeepSpeed-Chat codebase explicitly documents this requirement in the training script comments and provides calibrated settings for single-GPU training scenarios.

Usage

Apply this heuristic when switching from full fine-tuning to LoRA-based fine-tuning, or when setting up single-GPU training for models that would otherwise require multi-GPU. This is especially relevant for Step 1 (SFT) and Step 3 (RLHF) of the DeepSpeed-Chat pipeline on memory-constrained hardware.

The Insight (Rule of Thumb)

Action: Increase learning rate when switching from full fine-tuning to LoRA.
Value: Use `--lora_dim 128` with `--gradient_accumulation_steps 8` for single-GPU training.
Trade-off: Higher learning rate may cause instability if set too high; start with 2-3x the full fine-tuning rate and adjust. Gradient accumulation compensates for smaller per-GPU batch size.
Single GPU configuration: OPT-1.3B with LoRA on single GPU: `--lora_dim 128 --gradient_accumulation_steps 8`
Scope: LoRA can fine-tune models up to 6.7B on a single A6000 (48GB) and up to 30B on a single node with 8 GPUs.

Reasoning

In full fine-tuning, the gradient directly updates all model parameters. In LoRA, updates are projected through low-rank matrices: `W = W_0 + BA` where B is (d x r) and A is (r x d), with r << d. The effective rank r limits the expressiveness of each update step, meaning more gradient steps or larger step sizes are needed to achieve equivalent parameter changes. The gradient accumulation of 8 steps compensates for the smaller micro-batch size on single GPU, maintaining a reasonable effective global batch size.

The factor by which LoRA reduces the checkpoint activation overhead is `k = lora_dim * 2 / hidden_size`. For OPT-1.3B (hidden=2048) with lora_dim=128: k = 0.125, meaning only 12.5% of gradient computation is needed for the LoRA parameters.

Code Evidence:

Explicit comment from `training/step1_supervised_finetuning/training_scripts/opt/single_gpu/run_1.3b.sh:7`:

# Note that usually LoRA needs to use larger learning rate

Configuration from the same script at line 19:

   --gradient_accumulation_steps 8 --lora_dim 128 \

LoRA overhead calculation from `dschat/utils/perf.py:67-74`:

if args.lora_dim > 0:
    k = args.lora_dim * 2 / config.hidden_size
    checkpoint_activations_factor -= (1 - k)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment