Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Microsoft LoRA Low Rank Layer Replacement

From Leeroopedia


Knowledge Sources
Domains Model_Architecture, Parameter_Efficient_Fine_Tuning
Last Updated 2026-02-10 05:00 GMT

Overview

Principle of replacing standard neural network layers with low-rank augmented versions that inject trainable rank-decomposition matrices alongside frozen pretrained weights.

Description

LoRA's core architectural insight is that standard layers (Linear, Embedding, Conv2d) can be replaced with drop-in equivalents that add a low-rank trainable path in parallel with the frozen pretrained weight matrix. Each LoRA-augmented layer inherits from both the original PyTorch module (e.g., nn.Linear) and a LoRALayer base class, forming a dual-inheritance pattern. The pretrained weight W remains frozen while two small matrices B and A are trained. The forward pass computes:

h=Wx+BAxαr

where r is the rank and alpha is a scaling hyperparameter. This design means LoRA layers are drop-in replacements that preserve the original layer's interface.

Usage

Use this principle whenever modifying a pretrained model for LoRA fine-tuning. Replace target layers (typically attention projections) with their LoRA equivalents. The choice of which layers to replace and the rank r are the primary hyperparameters.

Theoretical Basis

Low-Rank Decomposition

The weight update during fine-tuning is decomposed as:

W=W+BA

where:

  • W is the original pretrained weight matrix (frozen), with dimensions (d_out x d_in)
  • B is a trainable matrix of shape (d_out x r)
  • A is a trainable matrix of shape (r x d_in)
  • r is the LoRA rank, typically 1-64, much smaller than d_out and d_in

The total number of trainable parameters per layer is r * (d_out + d_in), compared to d_out * d_in for full fine-tuning.

Scaling Factor

The LoRA output is scaled by the factor alpha / r, where alpha (lora_alpha) is a constant hyperparameter. This scaling ensures that changing the rank r does not require retuning the learning rate. When alpha equals r, the scaling factor is 1 and the LoRA contribution is unscaled.

Initialization Strategy

To ensure that LoRA starts as an identity transformation (i.e., the model behaves identically to the pretrained model at initialization):

  • A is initialized with Kaiming uniform initialization (the standard PyTorch default)
  • B is initialized with zeros

Since B starts as zeros, the product BA is zero at initialization, meaning the LoRA-augmented layer produces identical output to the original pretrained layer before any training occurs.

Dual-Inheritance Pattern

Each LoRA layer class inherits from both the corresponding PyTorch module and the LoRALayer base class:

  • Linear inherits from nn.Linear and LoRALayer
  • Embedding inherits from nn.Embedding and LoRALayer
  • MergedLinear inherits from nn.Linear and LoRALayer
  • Conv2d (via ConvLoRA) inherits from the corresponding nn.Conv* and LoRALayer

The LoRALayer base class manages common attributes: r, lora_alpha, scaling, lora_dropout, and the merged state flag.

Supported Layer Types

Layer Type Use Case Key Feature
Linear Standard linear projections (Q, K, V individually) Basic LoRA with optional fan_in_fan_out transpose
Embedding Token embedding layers LoRA matrices applied after embedding lookup
MergedLinear Combined QKV projections (e.g., GPT-2 c_attn) Selective enable_lora list to apply LoRA to subset of merged outputs
Conv2d Convolutional layers LoRA via ConvLoRA base class, also supports Conv1d and Conv3d

Related Pages

Implemented By

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment