Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Microsoft DeepSpeedExamples LoRA Fusion And Export

From Leeroopedia


  1. Principle: LoRA_Fusion_And_Export

Metadata

Field Value
Page Type Principle
Title LoRA_Fusion_And_Export
Sources Paper: LoRA (https://arxiv.org/abs/2106.09685)
Domains Model_Serialization, Fine_Tuning
Repository Microsoft/DeepSpeedExamples
Application DeepSpeed-VisualChat
Status Active

Overview

A technique for merging LoRA adapter weights back into the base model and saving the fused model for deployment.

Description

During training with LoRA (Low-Rank Adaptation), the original model weights are frozen and small low-rank matrices are learned alongside them. At inference time, these adapter weights can be fused (merged) back into the original weight matrices, eliminating the overhead of the LoRA decomposition and producing a standard model that runs at full speed without any LoRA-specific code.

The LoRA Decomposition

For each targeted linear layer, LoRA replaces the forward pass:

Original:     y = W * x + b
With LoRA:    y = W * x + b + (scaling * (x @ right_weight @ left_weight))

Where:

  • W is the original frozen weight matrix of shape [out_features, in_features]
  • right_weight (B) has shape [in_features, r]
  • left_weight (A) has shape [r, out_features]
  • r is the LoRA rank (typically 8-64, much smaller than both dimensions)
  • scaling = lora_scaling / lora_dim

Fusion Operation

Fusion merges the LoRA contribution into the base weight:

W_fused = W + scaling * (left_weight^T @ right_weight^T)

After fusion, the forward pass becomes:

y = W_fused * x + b

This is mathematically equivalent to the LoRA forward pass but requires no additional computation or memory for the adapter weights.

Unfusion (Reversibility)

The fusion is reversible by subtracting the same contribution:

W = W_fused - scaling * (left_weight^T @ right_weight^T)

This allows the training loop to fuse for saving, then unfuse to continue training:

for each epoch:
    train(model)
    model = fuse_lora(model)       # merge adapters for saving
    save(model)                     # save fused weights
    model = unfuse_lora(model)     # restore for continued training

Theoretical Basis

LoRA Mathematical Foundation

LoRA (Hu et al., 2021) decomposes the weight update as a low-rank product:

Delta_W = scaling * B * A

where:
    B in R^(d x r)     # lora_right_weight (transposed in code)
    A in R^(r x k)     # lora_left_weight (transposed in code)
    r << min(d, k)     # rank constraint

Full weight during training:
    W_effective = W_0 + scaling * B * A

Fused weight for deployment:
    W_fused = W_0 + scaling * B * A

The key insight is that by keeping r small (e.g., 16), the number of trainable parameters is r * (d + k) instead of d * k, providing orders of magnitude reduction.

Initialization Strategy

In DeepSpeed-VisualChat's implementation:

  • right_weight (B) is initialized with Kaiming uniform (He initialization)
  • left_weight (A) is initialized to zeros
  • This ensures the initial LoRA contribution is zero: Delta_W = scaling * B * 0 = 0
  • The model starts from its pre-trained state and gradually learns the adaptation

Scaling Factor

The scaling factor normalizes the LoRA contribution:

scaling = lora_scaling / lora_dim

With default lora_scaling = 1, this becomes 1/r, which prevents the LoRA contribution from growing proportionally with rank.

ZeRO-3 Parameter Gathering

In ZeRO Stage 3, model parameters are partitioned across GPUs. Before fusion or saving, all relevant parameters must be gathered to a single rank:

with deepspeed.zero.GatheredParameters(
    [weight, bias, lora_left_weight, lora_right_weight],
    modifier_rank=0,
    enabled=zero_stage_3):
    module.fuse_lora_weight()

For model saving in ZeRO-3, each parameter is individually gathered:

for k, v in model.named_parameters():
    if hasattr(v, 'ds_id'):
        with deepspeed.zero.GatheredParameters([v], enabled=True):
            v_p = v.data.clone().detach().cpu()
    if global_rank == 0 and "lora" not in k:
        output_state_dict[k] = v_p

Note that LoRA parameters are excluded from the saved state dict ("lora" not in k) because they have already been fused into the base weights.

Key Considerations

  • Fusion before saving -- Always fuse LoRA weights before saving the model. The saved model should contain only the fused base weights, not separate LoRA matrices.
  • Unfusion for continued training -- After saving, unfuse LoRA weights to restore the decomposed form for continued gradient computation. Training LoRA requires the separate matrices.
  • Memory during fusion -- The fusion operation requires the full weight matrix to be in memory (not partitioned). For ZeRO-3, this means temporary gathering which increases peak memory.
  • Rank 0 saving -- Only rank 0 (global rank == 0) actually writes the state dict to disk. Other ranks participate in gathering but do not save.
  • LoRA exclusion from checkpoints -- When saving fused models, LoRA parameter keys are excluded from the state dict since their contribution is already merged into the base weights.
  • Dropout during training -- LoRA layers include optional dropout (lora_dropout) applied to the input before the low-rank transformation. This is only active during training and disabled during evaluation.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment