Principle:Microsoft LoRA NLU LoRA Injection
Overview
NLU LoRA Injection describes how the microsoft/LoRA repository configures the injection of low-rank adaptation layers into pretrained NLU models (RoBERTa and DeBERTa V2) for GLUE task fine-tuning. The injection is controlled through configuration flags passed via command-line arguments and propagated through the HuggingFace AutoConfig system into the model architecture.
LoRA (Low-Rank Adaptation; Hu et al., 2021; arXiv:2106.09685) replaces selected weight matrices with low-rank decompositions. For a pretrained weight matrix W of dimension d x d, LoRA introduces two small matrices A (dimension d x r) and B (dimension r x d) such that the effective weight becomes W + (alpha/r) * B @ A, where r is the rank and alpha is a scaling factor.
Injection Mechanism
The modified HuggingFace Transformers fork uses a config-driven conditional replacement strategy:
- The
ModelArgumentsdataclass inrun_glue.pydefines LoRA-specific CLI flags:--apply_lora,--lora_r,--lora_alpha, and--lora_path - These values are passed to
AutoConfig.from_pretrained()as extra keyword arguments - The model config object carries these values into the model constructor
- Inside each self-attention layer's
__init__, the code checksconfig.apply_loraand conditionally creates either alora.Linearor a standardnn.Linear
This design means that the same model code path handles both LoRA and non-LoRA training, controlled purely by configuration.
Target Layers
In both RoBERTa and DeBERTa V2, LoRA is injected into exactly two projections per attention layer:
- Query projection (
self.queryin RoBERTa,self.query_projin DeBERTa V2) - Value projection (
self.valuein RoBERTa,self.value_projin DeBERTa V2)
The key projection is not modified. This follows the finding from the LoRA paper that adapting query and value projections yields the best accuracy-efficiency tradeoff for NLU tasks.
RoBERTa Injection
In RobertaSelfAttention.__init__ (modeling_roberta.py):
import loralib as lora
if config.apply_lora:
self.query = lora.Linear(config.hidden_size, self.all_head_size,
config.lora_r, lora_alpha=config.lora_alpha)
else:
self.query = nn.Linear(config.hidden_size, self.all_head_size)
if config.apply_lora:
self.value = lora.Linear(config.hidden_size, self.all_head_size,
config.lora_r, lora_alpha=config.lora_alpha)
else:
self.value = nn.Linear(config.hidden_size, self.all_head_size)
DeBERTa V2 Injection
In DisentangledSelfAttention.__init__ (modeling_deberta_v2.py):
import loralib as lora
if config.apply_lora:
self.query_proj = lora.Linear(config.hidden_size, self.all_head_size,
r=config.lora_r, lora_alpha=config.lora_alpha,
merge_weights=False)
else:
self.query_proj = nn.Linear(config.hidden_size, self.all_head_size, bias=True)
if config.apply_lora:
self.value_proj = lora.Linear(config.hidden_size, self.all_head_size,
r=config.lora_r, lora_alpha=config.lora_alpha,
merge_weights=False)
else:
self.value_proj = nn.Linear(config.hidden_size, self.all_head_size, bias=True)
The merge_weights=False flag in DeBERTa V2 ensures that the low-rank matrices lora_A and lora_B are kept as separate parameters during training, which is important for later weight extraction and checkpoint management.
LoRA Weight Initialization
The DeBERTa V2 model includes custom initialization for LoRA parameters in its _init_weights method:
if hasattr(module.query_proj, 'lora_A'):
module.query_proj.lora_A.data.normal_(mean=0.0, std=self.config.initializer_range)
if hasattr(module.value_proj, 'lora_A'):
module.value_proj.lora_A.data.normal_(mean=0.0, std=self.config.initializer_range)
The lora_A matrix is initialized with a normal distribution, while lora_B is typically initialized to zero (the loralib default), ensuring that the LoRA contribution starts at zero and the model begins training from the pretrained weights.
Typical Configurations
| Model | Rank (r) | Alpha | Trainable LoRA Params | Total Params |
|---|---|---|---|---|
| RoBERTa-base | 8 | 16 | ~0.3M | 125M |
| RoBERTa-large | 8 | 16 | ~0.8M | 355M |
| DeBERTa V2 XXL | 16 | 32 | ~4.7M | 1.5B |
The alpha/r ratio acts as a learning rate scaling factor for LoRA. A ratio of 2 (alpha=16, r=8 or alpha=32, r=16) is the standard configuration used across all NLU experiments in this repository.
Configuration Flow
The complete flow from CLI to model architecture:
- CLI flags are parsed into
ModelArgumentsdataclass AutoConfig.from_pretrained()receivesapply_lora,lora_r, andlora_alpha- Config object is passed to
AutoModelForSequenceClassification.from_pretrained() - Model constructor passes config to each attention layer
- Each attention layer conditionally creates
lora.Linearornn.Linear
Metadata
| Field | Value |
|---|---|
| Source | Repo (microsoft/LoRA) |
| Domains | Configuration, NLU, LoRA |
| Related | Implementation:Microsoft_LoRA_Run_GLUE_LoRA_Config |