Heuristic:Microsoft LoRA LoRA Rank Selection

Knowledge Sources	LoRA: Low-Rank Adaptation of Large Language Models Microsoft LoRA
Domains	Optimization, LLMs
Last Updated	2026-02-10 05:30 GMT

Overview

Guidance on choosing the LoRA rank (r) parameter: use r=4 for NLG tasks (GPT-2), r=8 for NLU base models (RoBERTa-base), and r=16 for large NLU models (DeBERTa XXL), with alpha typically set to 2*r or a fixed value.

Description

The LoRA rank r controls the dimensionality of the low-rank update matrices A (r x d) and B (d x r). Lower rank means fewer trainable parameters but less expressive power. The Microsoft LoRA repository demonstrates different rank settings across tasks: r=4 for GPT-2 NLG tasks (with alpha=128), r=8 for RoBERTa-base NLU tasks (with alpha=16), and r=16 for DeBERTa V2 XXLarge NLU tasks (with alpha=32). The total trainable parameter count is proportional to r.

Usage

Choose the LoRA rank when configuring LoRA adaptation for a new model or task. Start with r=4 for smaller models or generation tasks, r=8 for base-size classification models, and r=16 for very large models (1B+ parameters). The paper shows that even very small ranks (r=1-4) can achieve competitive results.

The Insight (Rule of Thumb)

Action: Set `lora_dim` / `lora_r` based on model size and task type.
Values:
- GPT-2 Medium/Large NLG: `lora_dim=4`, `lora_alpha=128` (~0.35M params)
- RoBERTa-base NLU: `lora_r=8`, `lora_alpha=16` (~0.8M params)
- DeBERTa V2 XXLarge NLU: `lora_r=16`, `lora_alpha=32` (~4.7M params)
Trade-off: Higher rank = more parameters = more expressive but higher memory and compute cost. The paper shows diminishing returns above r=4-8 for most tasks.
Parameter count: For each adapted layer, trainable params = 2 * r * d (where d is the hidden dimension).

Reasoning

The LoRA paper (Section 7.2) shows that the weight update matrix has a very low "intrinsic rank" — even r=1 achieves surprisingly good performance. Increasing r beyond 4-8 provides diminishing returns for most tasks. Larger models (DeBERTa XXL with 1.5B params) benefit from slightly higher rank because they have more attention heads and a larger hidden dimension, providing more subspace to adapt. The ratio of trainable to total parameters remains tiny: 0.8M/125M = 0.6% for RoBERTa-base, 4.7M/1.5B = 0.3% for DeBERTa XXL.

Code Evidence

GPT-2 model card configs from `examples/NLG/src/gpt2_ft.py:73-75`:

parser.add_argument('--lora_dim', type=int, default=0, help='lora attn dimension')
parser.add_argument('--lora_alpha', type=int, default=128, help='lora attn alpha')

RoBERTa-base config from `examples/NLU/roberta_base_mnli.sh:23-24`:

--lora_r 8 \
--lora_alpha 16 \

DeBERTa V2 XXLarge config from `examples/NLU/deberta_v2_xxlarge_mnli.sh:27-28`:

--lora_r 16 \
--lora_alpha 32 \

README benchmark table (excerpt):

RoBERTa base LoRA: 0.8M trainable params, avg GLUE 87.24
DeBERTa XXL LoRA: 4.7M trainable params, avg GLUE 91.32
GPT-2 M LoRA: 0.35M trainable params

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment