Heuristic:Huggingface Alignment handbook QLoRA Learning Rate Scaling

Knowledge Sources	Alignment Handbook QLoRA Internal
Domains	Optimization, LLMs
Last Updated	2026-02-07 00:00 GMT

Overview

QLoRA training requires approximately 10x higher learning rates than full fine-tuning because only small LoRA adapter weights are being updated.

Description

When switching from full fine-tuning to QLoRA, the learning rate must be increased significantly. This is because LoRA adapters are low-rank decompositions of the weight updates, and only a small fraction of parameters are trainable. The alignment-handbook recipes consistently use 2e-5 for full SFT and 2e-4 for QLoRA SFT (10x higher), and 5e-7 for full DPO and 5e-6 for QLoRA DPO (10x higher).

Usage

Apply this when switching between full and QLoRA fine-tuning recipes. If you are creating custom QLoRA configs, scale the learning rate up by approximately 10x compared to the full fine-tuning equivalent.

The Insight (Rule of Thumb)

Action: Multiply the learning rate by ~10x when switching from full fine-tuning to QLoRA.
Value:
- SFT full: `learning_rate: 2.0e-05` -> SFT QLoRA: `learning_rate: 2.0e-04`
- DPO full: `learning_rate: 5.0e-7` -> DPO QLoRA: `learning_rate: 5.0e-6`
Trade-off: Too low a learning rate with QLoRA leads to underfitting; too high leads to instability.

Reasoning

LoRA adapters add a low-rank perturbation (A * B) to the frozen base weights. Since the adapter matrices are small (rank 16-128) and randomly initialized near zero, they need larger gradient steps to produce meaningful updates. The alignment-handbook recipes encode this knowledge consistently:

SFT learning rates from recipe configs:

# recipes/zephyr-7b-beta/sft/config_full.yaml:37
learning_rate: 2.0e-05

# recipes/zephyr-7b-beta/sft/config_qlora.yaml:52
learning_rate: 2.0e-04

DPO learning rates from recipe configs:

# recipes/zephyr-7b-beta/dpo/config_full.yaml:37
learning_rate: 5.0e-7

# recipes/zephyr-7b-beta/dpo/config_qlora.yaml:44
learning_rate: 5.0e-6

The 10x factor is consistent across both SFT and DPO stages.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment