Heuristic:Microsoft DeepSpeedExamples LoRA Learning Rate Scaling
| Knowledge Sources | |
|---|---|
| Domains | Optimization, LLMs, RLHF |
| Last Updated | 2026-02-07 13:00 GMT |
Overview
LoRA fine-tuning requires a higher learning rate than full fine-tuning, and should be combined with increased gradient accumulation steps (e.g., 8) and a LoRA dimension of 128 for single-GPU training of 1.3B-6.7B models.
Description
When using Low-Rank Adaptation (LoRA) for parameter-efficient fine-tuning, the effective gradient magnitude is reduced because only a small fraction of parameters are updated. The low-rank projection matrices (A and B) compress the gradient information, requiring a proportionally higher learning rate to achieve the same parameter update magnitude as full fine-tuning. The DeepSpeed-Chat codebase explicitly documents this requirement in the training script comments and provides calibrated settings for single-GPU training scenarios.
Usage
Apply this heuristic when switching from full fine-tuning to LoRA-based fine-tuning, or when setting up single-GPU training for models that would otherwise require multi-GPU. This is especially relevant for Step 1 (SFT) and Step 3 (RLHF) of the DeepSpeed-Chat pipeline on memory-constrained hardware.
The Insight (Rule of Thumb)
- Action: Increase learning rate when switching from full fine-tuning to LoRA.
- Value: Use `--lora_dim 128` with `--gradient_accumulation_steps 8` for single-GPU training.
- Trade-off: Higher learning rate may cause instability if set too high; start with 2-3x the full fine-tuning rate and adjust. Gradient accumulation compensates for smaller per-GPU batch size.
- Single GPU configuration: OPT-1.3B with LoRA on single GPU: `--lora_dim 128 --gradient_accumulation_steps 8`
- Scope: LoRA can fine-tune models up to 6.7B on a single A6000 (48GB) and up to 30B on a single node with 8 GPUs.
Reasoning
In full fine-tuning, the gradient directly updates all model parameters. In LoRA, updates are projected through low-rank matrices: `W = W_0 + BA` where B is (d x r) and A is (r x d), with r << d. The effective rank r limits the expressiveness of each update step, meaning more gradient steps or larger step sizes are needed to achieve equivalent parameter changes. The gradient accumulation of 8 steps compensates for the smaller micro-batch size on single GPU, maintaining a reasonable effective global batch size.
The factor by which LoRA reduces the checkpoint activation overhead is `k = lora_dim * 2 / hidden_size`. For OPT-1.3B (hidden=2048) with lora_dim=128: k = 0.125, meaning only 12.5% of gradient computation is needed for the LoRA parameters.
Code Evidence:
Explicit comment from `training/step1_supervised_finetuning/training_scripts/opt/single_gpu/run_1.3b.sh:7`:
# Note that usually LoRA needs to use larger learning rate
Configuration from the same script at line 19:
--gradient_accumulation_steps 8 --lora_dim 128 \
LoRA overhead calculation from `dschat/utils/perf.py:67-74`:
if args.lora_dim > 0:
k = args.lora_dim * 2 / config.hidden_size
checkpoint_activations_factor -= (1 - k)