Heuristic:Huggingface Alignment handbook LoRA Rank Selection
| Knowledge Sources | |
|---|---|
| Domains | Optimization, LLMs |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
LoRA rank should be increased from 16 (SFT) to 128 (DPO) when training preference-aligned models to capture more nuanced preference distinctions.
Description
The alignment-handbook uses different LoRA ranks for different training stages. SFT uses rank 16 (sufficient for learning instruction-following patterns), while DPO uses rank 128 (needed to capture more subtle preference distinctions between chosen and rejected responses). Both stages use alpha equal to the rank (alpha=r), which normalizes the LoRA scaling factor to 1.
Usage
Apply this when configuring LoRA adapters for different training stages. Use lower rank (16) for SFT and higher rank (128) for DPO/preference optimization.
The Insight (Rule of Thumb)
- Action: Set `lora_r` to 16 for SFT training and 128 for DPO training. Set `lora_alpha` equal to `lora_r`.
- Value:
- SFT: `lora_r: 16`, `lora_alpha: 16`
- DPO: `lora_r: 128`, `lora_alpha: 128`
- Trade-off: Higher rank means more trainable parameters and higher memory usage, but better capacity to learn preference-relevant features.
Reasoning
SFT teaches general instruction-following behavior, which can be captured with a low-rank perturbation. DPO requires learning fine-grained distinctions between preferred and dispreferred responses, which requires a higher-rank representation. Setting alpha equal to rank (alpha=r) means the LoRA scaling factor is 1, simplifying hyperparameter tuning.
SFT QLoRA config from `recipes/zephyr-7b-beta/sft/config_qlora.yaml:10-11`:
lora_r: 16
lora_alpha: 16
DPO QLoRA config from `recipes/zephyr-7b-beta/dpo/config_qlora.yaml:9-10`:
lora_r: 128
lora_alpha: 128
Both configs target all linear layers:
lora_target_modules:
- q_proj
- k_proj
- v_proj
- o_proj
- gate_proj
- up_proj
- down_proj