Heuristic:Microsoft LoRA Scaling Factor Alpha Over R

Knowledge Sources	LoRA: Low-Rank Adaptation of Large Language Models Microsoft LoRA
Domains	Optimization, Deep_Learning
Last Updated	2026-02-10 05:30 GMT

Overview

The LoRA scaling factor `alpha/r` controls the magnitude of the low-rank update relative to the pretrained weights, enabling rank-independent learning rate tuning.

Description

Every LoRA layer computes its update as `W + (BA) * (alpha / r)`, where `alpha` is a fixed hyperparameter and `r` is the rank. The scaling factor `alpha/r` serves a critical purpose: it normalizes the LoRA update magnitude so that changing the rank does not require retuning the learning rate. When `alpha = r`, the scaling is 1.0 (no scaling). When `alpha > r` (e.g., alpha=128, r=4), the update is amplified. When `alpha < r`, the update is dampened. The NLG example uses a large alpha (128) with small rank (4) yielding scale=32, while the NLU example uses alpha=2*r (e.g., alpha=16 for r=8) yielding scale=2.

Usage

Set `lora_alpha` when configuring LoRA layers. A common pattern is `alpha = 2 * r` for NLU tasks. For NLG tasks, the repository uses a fixed `alpha = 128` regardless of rank. If you change the rank, you may need to adjust alpha to maintain the same effective update magnitude, or you can keep alpha fixed and adjust the learning rate.

The Insight (Rule of Thumb)

Action: Set `lora_alpha` relative to `lora_r` to control update magnitude.
Values:
- NLG (GPT-2): `alpha=128`, `r=4` → scaling = 32
- NLU (RoBERTa-base): `alpha=16`, `r=8` → scaling = 2
- NLU (DeBERTa XXL): `alpha=32`, `r=16` → scaling = 2
Rule: For NLU, use `alpha = 2 * r`. For NLG, use a larger fixed alpha with a small rank.
Trade-off: Higher alpha/r amplifies the LoRA update, making the model diverge more from pretrained weights. Lower alpha/r keeps the model closer to the pretrained checkpoint.

Reasoning

The paper (Section 4.1) explains: "We use alpha to scale the LoRA update so that we can roughly keep the same learning rate schedule when we vary r." Without scaling, doubling the rank would roughly double the norm of the BA update, requiring a halved learning rate. The alpha/r factor compensates for this, making hyperparameter search easier. The different alpha strategies (large fixed alpha for NLG vs. 2*r for NLU) reflect different task characteristics: NLG fine-tuning needs a stronger deviation from the pretrained model for creative generation, while NLU classification benefits from staying closer to pretrained representations.

Code Evidence

Scaling factor computation from `loralib/layers.py:112`:

self.scaling = self.lora_alpha / self.r

Scaling applied during forward pass from `loralib/layers.py:149`:

result += (self.lora_dropout(x) @ self.lora_A.transpose(0, 1) @ self.lora_B.transpose(0, 1)) * self.scaling

Scaling applied during weight merging from `loralib/layers.py:141`:

self.weight.data += T(self.lora_B @ self.lora_A) * self.scaling

GPT2Config defaults from `examples/NLG/src/model.py:308-309`:

lora_attn_dim=0,
lora_attn_alpha=128,

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment