Heuristic:Microsoft LoRA LoRA Init Strategy

Knowledge Sources	LoRA: Low-Rank Adaptation of Large Language Models Microsoft LoRA
Domains	Optimization, Deep_Learning
Last Updated	2026-02-10 05:30 GMT

Overview

Initialization strategy for LoRA matrices: use Kaiming uniform for A and zeros for B so the LoRA path starts as an identity (zero contribution) at the beginning of training.

Description

LoRA adds trainable low-rank matrices A and B to frozen pretrained weights, where the update is W + BA. The initialization strategy is critical: B is initialized to zeros and A is initialized with Kaiming uniform. This means the product BA = 0 at initialization, so the LoRA-adapted model starts exactly as the pretrained model. Training then learns the low-rank update from this stable starting point. Note the comment in the code says "this is different than what is described in the paper but should not affect performance" — the paper describes initializing A with a Gaussian and B with zeros, but the code uses Kaiming uniform for A instead.

Usage

Apply this initialization whenever implementing LoRA layers. It ensures training stability by starting from the pretrained model's behavior. If you initialize both A and B randomly, the initial LoRA contribution will be non-zero noise which can destabilize early training.

The Insight (Rule of Thumb)

Action: Initialize `lora_B` to zeros and `lora_A` with Kaiming uniform (for Linear layers) or zeros (for Embedding layers).
Value: `nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))` and `nn.init.zeros_(self.lora_B)`.
Trade-off: None — this is strictly better than random initialization for both matrices. The model starts identical to the pretrained checkpoint.
Exception: For Embedding layers, A is initialized to zeros and B with normal distribution (reversed from Linear), because the embedding lookup path differs from the linear projection path.

Reasoning

The product BA = 0 at initialization means the LoRA-adapted model produces identical outputs to the frozen pretrained model before any training. This provides a stable starting point for fine-tuning. Kaiming uniform for A (the down-projection) follows PyTorch's default for `nn.Linear`, providing appropriate scale for the gradient flow. The paper (Section 4.1) discusses initializing A with a random Gaussian and B with zero, noting that the opposite convention also works. The codebase uses Kaiming uniform as it is the standard PyTorch initialization.

Code Evidence

Linear layer initialization from `loralib/layers.py:119-125`:

def reset_parameters(self):
    nn.Linear.reset_parameters(self)
    if hasattr(self, 'lora_A'):
        # initialize B the same way as the default for nn.Linear and A to zero
        # this is different than what is described in the paper but should not affect performance
        nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))
        nn.init.zeros_(self.lora_B)

Embedding layer initialization from `loralib/layers.py:55-60`:

def reset_parameters(self):
    nn.Embedding.reset_parameters(self)
    if hasattr(self, 'lora_A'):
        # initialize A the same way as the default for nn.Linear and B to zero
        nn.init.zeros_(self.lora_A)
        nn.init.normal_(self.lora_B)

ConvLoRA initialization from `loralib/layers.py:268-273`:

def reset_parameters(self):
    self.conv.reset_parameters()
    if hasattr(self, 'lora_A'):
        # initialize A the same way as the default for nn.Linear and B to zero
        nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))
        nn.init.zeros_(self.lora_B)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment