Principle:Microsoft LoRA Low Rank Layer Replacement
| Knowledge Sources | |
|---|---|
| Domains | Model_Architecture, Parameter_Efficient_Fine_Tuning |
| Last Updated | 2026-02-10 05:00 GMT |
Overview
Principle of replacing standard neural network layers with low-rank augmented versions that inject trainable rank-decomposition matrices alongside frozen pretrained weights.
Description
LoRA's core architectural insight is that standard layers (Linear, Embedding, Conv2d) can be replaced with drop-in equivalents that add a low-rank trainable path in parallel with the frozen pretrained weight matrix. Each LoRA-augmented layer inherits from both the original PyTorch module (e.g., nn.Linear) and a LoRALayer base class, forming a dual-inheritance pattern. The pretrained weight W remains frozen while two small matrices B and A are trained. The forward pass computes:
where r is the rank and alpha is a scaling hyperparameter. This design means LoRA layers are drop-in replacements that preserve the original layer's interface.
Usage
Use this principle whenever modifying a pretrained model for LoRA fine-tuning. Replace target layers (typically attention projections) with their LoRA equivalents. The choice of which layers to replace and the rank r are the primary hyperparameters.
Theoretical Basis
Low-Rank Decomposition
The weight update during fine-tuning is decomposed as:
where:
- W is the original pretrained weight matrix (frozen), with dimensions (d_out x d_in)
- B is a trainable matrix of shape (d_out x r)
- A is a trainable matrix of shape (r x d_in)
- r is the LoRA rank, typically 1-64, much smaller than d_out and d_in
The total number of trainable parameters per layer is r * (d_out + d_in), compared to d_out * d_in for full fine-tuning.
Scaling Factor
The LoRA output is scaled by the factor alpha / r, where alpha (lora_alpha) is a constant hyperparameter. This scaling ensures that changing the rank r does not require retuning the learning rate. When alpha equals r, the scaling factor is 1 and the LoRA contribution is unscaled.
Initialization Strategy
To ensure that LoRA starts as an identity transformation (i.e., the model behaves identically to the pretrained model at initialization):
- A is initialized with Kaiming uniform initialization (the standard PyTorch default)
- B is initialized with zeros
Since B starts as zeros, the product BA is zero at initialization, meaning the LoRA-augmented layer produces identical output to the original pretrained layer before any training occurs.
Dual-Inheritance Pattern
Each LoRA layer class inherits from both the corresponding PyTorch module and the LoRALayer base class:
- Linear inherits from nn.Linear and LoRALayer
- Embedding inherits from nn.Embedding and LoRALayer
- MergedLinear inherits from nn.Linear and LoRALayer
- Conv2d (via ConvLoRA) inherits from the corresponding nn.Conv* and LoRALayer
The LoRALayer base class manages common attributes: r, lora_alpha, scaling, lora_dropout, and the merged state flag.
Supported Layer Types
| Layer Type | Use Case | Key Feature |
|---|---|---|
| Linear | Standard linear projections (Q, K, V individually) | Basic LoRA with optional fan_in_fan_out transpose |
| Embedding | Token embedding layers | LoRA matrices applied after embedding lookup |
| MergedLinear | Combined QKV projections (e.g., GPT-2 c_attn) | Selective enable_lora list to apply LoRA to subset of merged outputs |
| Conv2d | Convolutional layers | LoRA via ConvLoRA base class, also supports Conv1d and Conv3d |