Principle:Microsoft LoRA Low Rank Layer Replacement

Knowledge Sources	Microsoft LoRA LoRA
Domains	Model_Architecture, Parameter_Efficient_Fine_Tuning
Last Updated	2026-02-10 05:00 GMT

Overview

Principle of replacing standard neural network layers with low-rank augmented versions that inject trainable rank-decomposition matrices alongside frozen pretrained weights.

Description

LoRA's core architectural insight is that standard layers (Linear, Embedding, Conv2d) can be replaced with drop-in equivalents that add a low-rank trainable path in parallel with the frozen pretrained weight matrix. Each LoRA-augmented layer inherits from both the original PyTorch module (e.g., nn.Linear) and a LoRALayer base class, forming a dual-inheritance pattern. The pretrained weight W remains frozen while two small matrices B and A are trained. The forward pass computes:

$h = W x + B A x \cdot \frac{α}{r}$

where r is the rank and alpha is a scaling hyperparameter. This design means LoRA layers are drop-in replacements that preserve the original layer's interface.

Usage

Use this principle whenever modifying a pretrained model for LoRA fine-tuning. Replace target layers (typically attention projections) with their LoRA equivalents. The choice of which layers to replace and the rank r are the primary hyperparameters.

Theoretical Basis

Low-Rank Decomposition

The weight update during fine-tuning is decomposed as:

$W^{'} = W + B A$

where:

W is the original pretrained weight matrix (frozen), with dimensions (d_out x d_in)
B is a trainable matrix of shape (d_out x r)
A is a trainable matrix of shape (r x d_in)
r is the LoRA rank, typically 1-64, much smaller than d_out and d_in

The total number of trainable parameters per layer is r * (d_out + d_in), compared to d_out * d_in for full fine-tuning.

Scaling Factor

The LoRA output is scaled by the factor alpha / r, where alpha (lora_alpha) is a constant hyperparameter. This scaling ensures that changing the rank r does not require retuning the learning rate. When alpha equals r, the scaling factor is 1 and the LoRA contribution is unscaled.

Initialization Strategy

To ensure that LoRA starts as an identity transformation (i.e., the model behaves identically to the pretrained model at initialization):

A is initialized with Kaiming uniform initialization (the standard PyTorch default)
B is initialized with zeros

Since B starts as zeros, the product BA is zero at initialization, meaning the LoRA-augmented layer produces identical output to the original pretrained layer before any training occurs.

Dual-Inheritance Pattern

Each LoRA layer class inherits from both the corresponding PyTorch module and the LoRALayer base class:

Linear inherits from nn.Linear and LoRALayer
Embedding inherits from nn.Embedding and LoRALayer
MergedLinear inherits from nn.Linear and LoRALayer
Conv2d (via ConvLoRA) inherits from the corresponding nn.Conv* and LoRALayer

The LoRALayer base class manages common attributes: r, lora_alpha, scaling, lora_dropout, and the merged state flag.

Supported Layer Types

Layer Type	Use Case	Key Feature
Linear	Standard linear projections (Q, K, V individually)	Basic LoRA with optional fan_in_fan_out transpose
Embedding	Token embedding layers	LoRA matrices applied after embedding lookup
MergedLinear	Combined QKV projections (e.g., GPT-2 c_attn)	Selective enable_lora list to apply LoRA to subset of merged outputs
Conv2d	Convolutional layers	LoRA via ConvLoRA base class, also supports Conv1d and Conv3d

Related Pages

Implemented By

Implementation:Microsoft_LoRA_LoRA_Layers

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment