Heuristic:Microsoft LoRA Selective LoRA QV Only
| Knowledge Sources | |
|---|---|
| Domains | Optimization, LLMs |
| Last Updated | 2026-02-10 05:30 GMT |
Overview
Apply LoRA only to the Query (Q) and Value (V) projections in attention, skipping the Key (K) projection, using the `enable_lora=[True, False, True]` pattern with MergedLinear.
Description
In Transformer attention, the Q, K, V projections are often implemented as a single fused linear layer. LoRA's `MergedLinear` allows selectively enabling adaptation on a subset of these projections using the `enable_lora` boolean list. The pattern `[True, False, True]` adapts Q and V while leaving K frozen. This is the recommended default in the Microsoft LoRA repository and is based on the paper's finding that adapting Q+V provides the best balance of performance and parameter efficiency.
Usage
Use `enable_lora=[True, False, True]` when replacing a fused QKV attention projection with `lora.MergedLinear`. This is the standard configuration for GPT-2 NLG fine-tuning. For NLU models (RoBERTa, DeBERTa), LoRA is applied to separate Q and V linear layers directly.
The Insight (Rule of Thumb)
- Action: Use `lora.MergedLinear` with `enable_lora=[True, False, True]` for fused QKV projections.
- Value: Adapts Q and V, skips K.
- Trade-off: 2/3 of the LoRA parameters compared to adapting all three (Q, K, V). Performance is comparable or better than adapting all three, per the paper.
- Constraint: `out_features` must be divisible by `len(enable_lora)` (enforced by assertion: "The length of enable_lora must divide out_features").
Reasoning
The LoRA paper (Table 5) compares different combinations of adapted weight matrices. Adapting Q+V achieves the best results for a given parameter budget. The intuition is that Q and V carry the most task-specific information in attention: Q determines what to attend to, V determines what information to extract. K acts as a bridge and benefits less from adaptation. Skipping K reduces trainable parameters by ~33% with minimal performance loss.
Code Evidence
MergedLinear with selective LoRA from `examples/NLG/src/model.py:94-102`:
self.c_attn = lora.MergedLinear(
nx, n_state * 3,
r=config.lora_attn_dim,
lora_alpha=config.lora_attn_alpha,
lora_dropout=config.lora_dropout,
enable_lora=[True, False, True], # Q=True, K=False, V=True
fan_in_fan_out=True,
merge_weights=False
)
enable_lora divisibility assertion from `loralib/layers.py:172-173`:
assert out_features % len(enable_lora) == 0, \
'The length of enable_lora must divide out_features'
Boolean index mask construction from `loralib/layers.py:187-191`:
self.lora_ind = self.weight.new_zeros(
(out_features, ), dtype=torch.bool
).view(len(enable_lora), -1)
self.lora_ind[enable_lora, :] = True
self.lora_ind = self.lora_ind.view(-1)
README note on Q+V adaptation:
We focus on a simple yet effective setup, namely adapting only the q and v projection
in a Transformer. LoRA can be applied to any subsets of pre-trained weights.