Heuristic:LLMBook zh LLMBook zh github io LoRA Initialization Strategy

Knowledge Sources	LLMBook-zh LoRA: Low-Rank Adaptation of Large Language Models
Domains	LLMs, Optimization, Parameter_Efficient_Finetuning
Last Updated	2026-02-08 04:30 GMT

Overview

Initialize LoRA matrices with A~Normal(0, 0.02) and B=zeros, using rank=16, alpha=16, dropout=0.05 as default hyperparameters.

Description

LoRA (Low-Rank Adaptation) adds two low-rank matrices A and B to frozen pretrained weights. The initialization strategy is critical: matrix B is initialized to zero so the LoRA contribution starts at zero (preserving the original model behavior), while matrix A is initialized from a Gaussian distribution with std=0.02. The default hyperparameters use rank r=16 (balancing parameter count vs. expressiveness), alpha=16 (scaling factor equal to rank), and dropout=0.05 (light regularization).

Usage

Use this heuristic when configuring LoRA for fine-tuning. The zero-initialization of B ensures training starts from the pretrained model's behavior, preventing catastrophic divergence. The rank=16 default is a safe starting point for most 7B-13B models.

The Insight (Rule of Thumb)

Action: Initialize LoRA matrix A with `normal_(std=0.02)` and matrix B with `zero_()`.
Value: Default hyperparameters: `r=16`, `alpha=16`, `dropout=0.05`.
Trade-off: Lower rank (r=4 or r=8) saves memory but may reduce adaptation capacity. Higher rank (r=32 or r=64) increases parameters but gives more expressive power.
Key Principle: B=0 ensures `W' = W + BA = W + 0 = W` at initialization, preserving the pretrained model.

Reasoning

The zero-initialization of B is the core insight from the original LoRA paper. Since `delta_W = B * A`, starting with B=0 means the initial adaptation is zero, so the model begins training from its pretrained state. The Gaussian initialization of A (std=0.02) follows the common Transformer weight initialization convention. Setting alpha equal to rank (alpha/r = 1.0) provides a neutral scaling factor. The 5% dropout is light enough to avoid underfitting while providing regularization.

Code Evidence:

LoRA weight initialization from `code/7.3 LoRA基础.py:17-20`:

# 使用标准差为0.02的正态分布初始化A的权重
self.A.weight.data.normal_(std=0.02)
# B的权重初始化为零
self.B.weight.data.zero_()

LoRA architecture definition from `code/7.3 LoRA基础.py:9-15`:

# 初始化A，将输入映射到低秩空间r
self.A = nn.Linear(in_features, self.r, bias=False)
# 初始化B，将低秩空间映射回原始输出空间
self.B = nn.Linear(self.r, out_features, bias=False)
# 初始化一个Dropout层
self.dropout = nn.Dropout(p=config.lora_dropout)

Default hyperparameters from `code/7.4 LoRA实践.py:22-26`:

lora_r: Optional[int] = HfArg(default=16, help='Lora attention dimension (the "rank")')
lora_alpha: Optional[int] = HfArg(default=16, help="The alpha parameter for Lora scaling.")
lora_dropout: Optional[float] = HfArg(default=0.05, help="The dropout probability for Lora layers.")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment