Heuristic:Microsoft LoRA LoRA Init Strategy
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Deep_Learning |
| Last Updated | 2026-02-10 05:30 GMT |
Overview
Initialization strategy for LoRA matrices: use Kaiming uniform for A and zeros for B so the LoRA path starts as an identity (zero contribution) at the beginning of training.
Description
LoRA adds trainable low-rank matrices A and B to frozen pretrained weights, where the update is W + BA. The initialization strategy is critical: B is initialized to zeros and A is initialized with Kaiming uniform. This means the product BA = 0 at initialization, so the LoRA-adapted model starts exactly as the pretrained model. Training then learns the low-rank update from this stable starting point. Note the comment in the code says "this is different than what is described in the paper but should not affect performance" — the paper describes initializing A with a Gaussian and B with zeros, but the code uses Kaiming uniform for A instead.
Usage
Apply this initialization whenever implementing LoRA layers. It ensures training stability by starting from the pretrained model's behavior. If you initialize both A and B randomly, the initial LoRA contribution will be non-zero noise which can destabilize early training.
The Insight (Rule of Thumb)
- Action: Initialize `lora_B` to zeros and `lora_A` with Kaiming uniform (for Linear layers) or zeros (for Embedding layers).
- Value: `nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))` and `nn.init.zeros_(self.lora_B)`.
- Trade-off: None — this is strictly better than random initialization for both matrices. The model starts identical to the pretrained checkpoint.
- Exception: For Embedding layers, A is initialized to zeros and B with normal distribution (reversed from Linear), because the embedding lookup path differs from the linear projection path.
Reasoning
The product BA = 0 at initialization means the LoRA-adapted model produces identical outputs to the frozen pretrained model before any training. This provides a stable starting point for fine-tuning. Kaiming uniform for A (the down-projection) follows PyTorch's default for `nn.Linear`, providing appropriate scale for the gradient flow. The paper (Section 4.1) discusses initializing A with a random Gaussian and B with zero, noting that the opposite convention also works. The codebase uses Kaiming uniform as it is the standard PyTorch initialization.
Code Evidence
Linear layer initialization from `loralib/layers.py:119-125`:
def reset_parameters(self):
nn.Linear.reset_parameters(self)
if hasattr(self, 'lora_A'):
# initialize B the same way as the default for nn.Linear and A to zero
# this is different than what is described in the paper but should not affect performance
nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))
nn.init.zeros_(self.lora_B)
Embedding layer initialization from `loralib/layers.py:55-60`:
def reset_parameters(self):
nn.Embedding.reset_parameters(self)
if hasattr(self, 'lora_A'):
# initialize A the same way as the default for nn.Linear and B to zero
nn.init.zeros_(self.lora_A)
nn.init.normal_(self.lora_B)
ConvLoRA initialization from `loralib/layers.py:268-273`:
def reset_parameters(self):
self.conv.reset_parameters()
if hasattr(self, 'lora_A'):
# initialize A the same way as the default for nn.Linear and B to zero
nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))
nn.init.zeros_(self.lora_B)