Principle:Microsoft DeepSpeedExamples LoRA Fusion And Export
- Principle: LoRA_Fusion_And_Export
Metadata
| Field | Value |
|---|---|
| Page Type | Principle |
| Title | LoRA_Fusion_And_Export |
| Sources | Paper: LoRA (https://arxiv.org/abs/2106.09685) |
| Domains | Model_Serialization, Fine_Tuning |
| Repository | Microsoft/DeepSpeedExamples |
| Application | DeepSpeed-VisualChat |
| Status | Active |
Overview
A technique for merging LoRA adapter weights back into the base model and saving the fused model for deployment.
Description
During training with LoRA (Low-Rank Adaptation), the original model weights are frozen and small low-rank matrices are learned alongside them. At inference time, these adapter weights can be fused (merged) back into the original weight matrices, eliminating the overhead of the LoRA decomposition and producing a standard model that runs at full speed without any LoRA-specific code.
The LoRA Decomposition
For each targeted linear layer, LoRA replaces the forward pass:
Original: y = W * x + b
With LoRA: y = W * x + b + (scaling * (x @ right_weight @ left_weight))
Where:
Wis the original frozen weight matrix of shape[out_features, in_features]right_weight(B) has shape[in_features, r]left_weight(A) has shape[r, out_features]ris the LoRA rank (typically 8-64, much smaller than both dimensions)scaling = lora_scaling / lora_dim
Fusion Operation
Fusion merges the LoRA contribution into the base weight:
W_fused = W + scaling * (left_weight^T @ right_weight^T)
After fusion, the forward pass becomes:
y = W_fused * x + b
This is mathematically equivalent to the LoRA forward pass but requires no additional computation or memory for the adapter weights.
Unfusion (Reversibility)
The fusion is reversible by subtracting the same contribution:
W = W_fused - scaling * (left_weight^T @ right_weight^T)
This allows the training loop to fuse for saving, then unfuse to continue training:
for each epoch:
train(model)
model = fuse_lora(model) # merge adapters for saving
save(model) # save fused weights
model = unfuse_lora(model) # restore for continued training
Theoretical Basis
LoRA Mathematical Foundation
LoRA (Hu et al., 2021) decomposes the weight update as a low-rank product:
Delta_W = scaling * B * A
where:
B in R^(d x r) # lora_right_weight (transposed in code)
A in R^(r x k) # lora_left_weight (transposed in code)
r << min(d, k) # rank constraint
Full weight during training:
W_effective = W_0 + scaling * B * A
Fused weight for deployment:
W_fused = W_0 + scaling * B * A
The key insight is that by keeping r small (e.g., 16), the number of trainable parameters is r * (d + k) instead of d * k, providing orders of magnitude reduction.
Initialization Strategy
In DeepSpeed-VisualChat's implementation:
right_weight(B) is initialized with Kaiming uniform (He initialization)left_weight(A) is initialized to zeros- This ensures the initial LoRA contribution is zero:
Delta_W = scaling * B * 0 = 0 - The model starts from its pre-trained state and gradually learns the adaptation
Scaling Factor
The scaling factor normalizes the LoRA contribution:
scaling = lora_scaling / lora_dim
With default lora_scaling = 1, this becomes 1/r, which prevents the LoRA contribution from growing proportionally with rank.
ZeRO-3 Parameter Gathering
In ZeRO Stage 3, model parameters are partitioned across GPUs. Before fusion or saving, all relevant parameters must be gathered to a single rank:
with deepspeed.zero.GatheredParameters(
[weight, bias, lora_left_weight, lora_right_weight],
modifier_rank=0,
enabled=zero_stage_3):
module.fuse_lora_weight()
For model saving in ZeRO-3, each parameter is individually gathered:
for k, v in model.named_parameters():
if hasattr(v, 'ds_id'):
with deepspeed.zero.GatheredParameters([v], enabled=True):
v_p = v.data.clone().detach().cpu()
if global_rank == 0 and "lora" not in k:
output_state_dict[k] = v_p
Note that LoRA parameters are excluded from the saved state dict ("lora" not in k) because they have already been fused into the base weights.
Key Considerations
- Fusion before saving -- Always fuse LoRA weights before saving the model. The saved model should contain only the fused base weights, not separate LoRA matrices.
- Unfusion for continued training -- After saving, unfuse LoRA weights to restore the decomposed form for continued gradient computation. Training LoRA requires the separate matrices.
- Memory during fusion -- The fusion operation requires the full weight matrix to be in memory (not partitioned). For ZeRO-3, this means temporary gathering which increases peak memory.
- Rank 0 saving -- Only rank 0 (global rank == 0) actually writes the state dict to disk. Other ranks participate in gathering but do not save.
- LoRA exclusion from checkpoints -- When saving fused models, LoRA parameter keys are excluded from the state dict since their contribution is already merged into the base weights.
- Dropout during training -- LoRA layers include optional dropout (
lora_dropout) applied to the input before the low-rank transformation. This is only active during training and disabled during evaluation.
Related Pages
- Implementation:Microsoft_DeepSpeedExamples_Fuse_LoRA -- The concrete LoRA fusion implementation
- Principle:Microsoft_DeepSpeedExamples_Multimodal_Distributed_Training -- The training loop that performs per-epoch fusion
- Principle:Microsoft_DeepSpeedExamples_Multimodal_Model_Composition -- The model architecture containing LoRA layers
- Heuristic:Microsoft_DeepSpeedExamples_LoRA_Learning_Rate_Scaling