Principle:Microsoft DeepSpeedExamples LoRA Fusion And Export

Principle: LoRA_Fusion_And_Export

Metadata

Field	Value
Page Type	Principle
Title	LoRA_Fusion_And_Export
Sources	Paper: LoRA (https://arxiv.org/abs/2106.09685)
Domains	Model_Serialization, Fine_Tuning
Repository	Microsoft/DeepSpeedExamples
Application	DeepSpeed-VisualChat
Status	Active

Overview

A technique for merging LoRA adapter weights back into the base model and saving the fused model for deployment.

Description

During training with LoRA (Low-Rank Adaptation), the original model weights are frozen and small low-rank matrices are learned alongside them. At inference time, these adapter weights can be fused (merged) back into the original weight matrices, eliminating the overhead of the LoRA decomposition and producing a standard model that runs at full speed without any LoRA-specific code.

The LoRA Decomposition

For each targeted linear layer, LoRA replaces the forward pass:

Original:     y = W * x + b
With LoRA:    y = W * x + b + (scaling * (x @ right_weight @ left_weight))

Where:

W is the original frozen weight matrix of shape [out_features, in_features]
right_weight (B) has shape [in_features, r]
left_weight (A) has shape [r, out_features]
r is the LoRA rank (typically 8-64, much smaller than both dimensions)
scaling = lora_scaling / lora_dim

Fusion Operation

Fusion merges the LoRA contribution into the base weight:

W_fused = W + scaling * (left_weight^T @ right_weight^T)

After fusion, the forward pass becomes:

y = W_fused * x + b

This is mathematically equivalent to the LoRA forward pass but requires no additional computation or memory for the adapter weights.

Unfusion (Reversibility)

The fusion is reversible by subtracting the same contribution:

W = W_fused - scaling * (left_weight^T @ right_weight^T)

This allows the training loop to fuse for saving, then unfuse to continue training:

for each epoch:
    train(model)
    model = fuse_lora(model)       # merge adapters for saving
    save(model)                     # save fused weights
    model = unfuse_lora(model)     # restore for continued training

Theoretical Basis

LoRA Mathematical Foundation

LoRA (Hu et al., 2021) decomposes the weight update as a low-rank product:

Delta_W = scaling * B * A

where:
    B in R^(d x r)     # lora_right_weight (transposed in code)
    A in R^(r x k)     # lora_left_weight (transposed in code)
    r << min(d, k)     # rank constraint

Full weight during training:
    W_effective = W_0 + scaling * B * A

Fused weight for deployment:
    W_fused = W_0 + scaling * B * A

The key insight is that by keeping r small (e.g., 16), the number of trainable parameters is r * (d + k) instead of d * k, providing orders of magnitude reduction.

Initialization Strategy

In DeepSpeed-VisualChat's implementation:

right_weight (B) is initialized with Kaiming uniform (He initialization)
left_weight (A) is initialized to zeros
This ensures the initial LoRA contribution is zero: Delta_W = scaling * B * 0 = 0
The model starts from its pre-trained state and gradually learns the adaptation

Scaling Factor

The scaling factor normalizes the LoRA contribution:

scaling = lora_scaling / lora_dim

With default lora_scaling = 1, this becomes 1/r, which prevents the LoRA contribution from growing proportionally with rank.

ZeRO-3 Parameter Gathering

In ZeRO Stage 3, model parameters are partitioned across GPUs. Before fusion or saving, all relevant parameters must be gathered to a single rank:

with deepspeed.zero.GatheredParameters(
    [weight, bias, lora_left_weight, lora_right_weight],
    modifier_rank=0,
    enabled=zero_stage_3):
    module.fuse_lora_weight()

For model saving in ZeRO-3, each parameter is individually gathered:

for k, v in model.named_parameters():
    if hasattr(v, 'ds_id'):
        with deepspeed.zero.GatheredParameters([v], enabled=True):
            v_p = v.data.clone().detach().cpu()
    if global_rank == 0 and "lora" not in k:
        output_state_dict[k] = v_p

Note that LoRA parameters are excluded from the saved state dict ("lora" not in k) because they have already been fused into the base weights.

Key Considerations

Fusion before saving -- Always fuse LoRA weights before saving the model. The saved model should contain only the fused base weights, not separate LoRA matrices.
Unfusion for continued training -- After saving, unfuse LoRA weights to restore the decomposed form for continued gradient computation. Training LoRA requires the separate matrices.
Memory during fusion -- The fusion operation requires the full weight matrix to be in memory (not partitioned). For ZeRO-3, this means temporary gathering which increases peak memory.
Rank 0 saving -- Only rank 0 (global rank == 0) actually writes the state dict to disk. Other ranks participate in gathering but do not save.
LoRA exclusion from checkpoints -- When saving fused models, LoRA parameter keys are excluded from the state dict since their contribution is already merged into the base weights.
Dropout during training -- LoRA layers include optional dropout (lora_dropout) applied to the input before the low-rank transformation. This is only active during training and disabled during evaluation.

Related Pages

Implementation:Microsoft_DeepSpeedExamples_Fuse_LoRA -- The concrete LoRA fusion implementation
Principle:Microsoft_DeepSpeedExamples_Multimodal_Distributed_Training -- The training loop that performs per-epoch fusion
Principle:Microsoft_DeepSpeedExamples_Multimodal_Model_Composition -- The model architecture containing LoRA layers
Heuristic:Microsoft_DeepSpeedExamples_LoRA_Learning_Rate_Scaling

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment