Principle:Microsoft LoRA NLU LoRA Injection

Overview

NLU LoRA Injection describes how the microsoft/LoRA repository configures the injection of low-rank adaptation layers into pretrained NLU models (RoBERTa and DeBERTa V2) for GLUE task fine-tuning. The injection is controlled through configuration flags passed via command-line arguments and propagated through the HuggingFace AutoConfig system into the model architecture.

LoRA (Low-Rank Adaptation; Hu et al., 2021; arXiv:2106.09685) replaces selected weight matrices with low-rank decompositions. For a pretrained weight matrix W of dimension d x d, LoRA introduces two small matrices A (dimension d x r) and B (dimension r x d) such that the effective weight becomes W + (alpha/r) * B @ A, where r is the rank and alpha is a scaling factor.

Injection Mechanism

The modified HuggingFace Transformers fork uses a config-driven conditional replacement strategy:

The ModelArguments dataclass in run_glue.py defines LoRA-specific CLI flags: --apply_lora, --lora_r, --lora_alpha, and --lora_path
These values are passed to AutoConfig.from_pretrained() as extra keyword arguments
The model config object carries these values into the model constructor
Inside each self-attention layer's __init__, the code checks config.apply_lora and conditionally creates either a lora.Linear or a standard nn.Linear

This design means that the same model code path handles both LoRA and non-LoRA training, controlled purely by configuration.

Target Layers

In both RoBERTa and DeBERTa V2, LoRA is injected into exactly two projections per attention layer:

Query projection (self.query in RoBERTa, self.query_proj in DeBERTa V2)
Value projection (self.value in RoBERTa, self.value_proj in DeBERTa V2)

The key projection is not modified. This follows the finding from the LoRA paper that adapting query and value projections yields the best accuracy-efficiency tradeoff for NLU tasks.

RoBERTa Injection

In RobertaSelfAttention.__init__ (modeling_roberta.py):

import loralib as lora

if config.apply_lora:
    self.query = lora.Linear(config.hidden_size, self.all_head_size,
                             config.lora_r, lora_alpha=config.lora_alpha)
else:
    self.query = nn.Linear(config.hidden_size, self.all_head_size)

if config.apply_lora:
    self.value = lora.Linear(config.hidden_size, self.all_head_size,
                             config.lora_r, lora_alpha=config.lora_alpha)
else:
    self.value = nn.Linear(config.hidden_size, self.all_head_size)

DeBERTa V2 Injection

In DisentangledSelfAttention.__init__ (modeling_deberta_v2.py):

import loralib as lora

if config.apply_lora:
    self.query_proj = lora.Linear(config.hidden_size, self.all_head_size,
                                  r=config.lora_r, lora_alpha=config.lora_alpha,
                                  merge_weights=False)
else:
    self.query_proj = nn.Linear(config.hidden_size, self.all_head_size, bias=True)

if config.apply_lora:
    self.value_proj = lora.Linear(config.hidden_size, self.all_head_size,
                                  r=config.lora_r, lora_alpha=config.lora_alpha,
                                  merge_weights=False)
else:
    self.value_proj = nn.Linear(config.hidden_size, self.all_head_size, bias=True)

The merge_weights=False flag in DeBERTa V2 ensures that the low-rank matrices lora_A and lora_B are kept as separate parameters during training, which is important for later weight extraction and checkpoint management.

LoRA Weight Initialization

The DeBERTa V2 model includes custom initialization for LoRA parameters in its _init_weights method:

if hasattr(module.query_proj, 'lora_A'):
    module.query_proj.lora_A.data.normal_(mean=0.0, std=self.config.initializer_range)
if hasattr(module.value_proj, 'lora_A'):
    module.value_proj.lora_A.data.normal_(mean=0.0, std=self.config.initializer_range)

The lora_A matrix is initialized with a normal distribution, while lora_B is typically initialized to zero (the loralib default), ensuring that the LoRA contribution starts at zero and the model begins training from the pretrained weights.

Typical Configurations

Model	Rank (r)	Alpha	Trainable LoRA Params	Total Params
RoBERTa-base	8	16	~0.3M	125M
RoBERTa-large	8	16	~0.8M	355M
DeBERTa V2 XXL	16	32	~4.7M	1.5B

The alpha/r ratio acts as a learning rate scaling factor for LoRA. A ratio of 2 (alpha=16, r=8 or alpha=32, r=16) is the standard configuration used across all NLU experiments in this repository.

Configuration Flow

The complete flow from CLI to model architecture:

CLI flags are parsed into ModelArguments dataclass
AutoConfig.from_pretrained() receives apply_lora, lora_r, and lora_alpha
Config object is passed to AutoModelForSequenceClassification.from_pretrained()
Model constructor passes config to each attention layer
Each attention layer conditionally creates lora.Linear or nn.Linear

Metadata

Field	Value
Source	Repo (microsoft/LoRA)
Domains	Configuration, NLU, LoRA
Related	Implementation:Microsoft_LoRA_Run_GLUE_LoRA_Config

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment