Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Microsoft LoRA Selective LoRA QV Only

From Leeroopedia




Knowledge Sources
Domains Optimization, LLMs
Last Updated 2026-02-10 05:30 GMT

Overview

Apply LoRA only to the Query (Q) and Value (V) projections in attention, skipping the Key (K) projection, using the `enable_lora=[True, False, True]` pattern with MergedLinear.

Description

In Transformer attention, the Q, K, V projections are often implemented as a single fused linear layer. LoRA's `MergedLinear` allows selectively enabling adaptation on a subset of these projections using the `enable_lora` boolean list. The pattern `[True, False, True]` adapts Q and V while leaving K frozen. This is the recommended default in the Microsoft LoRA repository and is based on the paper's finding that adapting Q+V provides the best balance of performance and parameter efficiency.

Usage

Use `enable_lora=[True, False, True]` when replacing a fused QKV attention projection with `lora.MergedLinear`. This is the standard configuration for GPT-2 NLG fine-tuning. For NLU models (RoBERTa, DeBERTa), LoRA is applied to separate Q and V linear layers directly.

The Insight (Rule of Thumb)

  • Action: Use `lora.MergedLinear` with `enable_lora=[True, False, True]` for fused QKV projections.
  • Value: Adapts Q and V, skips K.
  • Trade-off: 2/3 of the LoRA parameters compared to adapting all three (Q, K, V). Performance is comparable or better than adapting all three, per the paper.
  • Constraint: `out_features` must be divisible by `len(enable_lora)` (enforced by assertion: "The length of enable_lora must divide out_features").

Reasoning

The LoRA paper (Table 5) compares different combinations of adapted weight matrices. Adapting Q+V achieves the best results for a given parameter budget. The intuition is that Q and V carry the most task-specific information in attention: Q determines what to attend to, V determines what information to extract. K acts as a bridge and benefits less from adaptation. Skipping K reduces trainable parameters by ~33% with minimal performance loss.

Code Evidence

MergedLinear with selective LoRA from `examples/NLG/src/model.py:94-102`:

self.c_attn = lora.MergedLinear(
    nx, n_state * 3,
    r=config.lora_attn_dim,
    lora_alpha=config.lora_attn_alpha,
    lora_dropout=config.lora_dropout,
    enable_lora=[True, False, True],  # Q=True, K=False, V=True
    fan_in_fan_out=True,
    merge_weights=False
)

enable_lora divisibility assertion from `loralib/layers.py:172-173`:

assert out_features % len(enable_lora) == 0, \
    'The length of enable_lora must divide out_features'

Boolean index mask construction from `loralib/layers.py:187-191`:

self.lora_ind = self.weight.new_zeros(
    (out_features, ), dtype=torch.bool
).view(len(enable_lora), -1)
self.lora_ind[enable_lora, :] = True
self.lora_ind = self.lora_ind.view(-1)

README note on Q+V adaptation:

We focus on a simple yet effective setup, namely adapting only the q and v projection
in a Transformer. LoRA can be applied to any subsets of pre-trained weights.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment