Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Microsoft LoRA LoRA Integration

From Leeroopedia


Knowledge Sources
Domains LLMs, Parameter_Efficient_Fine_Tuning, PyTorch
Last Updated 2026-02-10 05:30 GMT

Overview

End-to-end process for integrating Low-Rank Adaptation (LoRA) into any existing PyTorch model using the loralib package.

Description

This workflow describes the standard operating procedure for adapting a pretrained PyTorch model using LoRA. The technique reduces trainable parameters by orders of magnitude (e.g., from 125M to 0.8M for RoBERTa-base) by learning pairs of low-rank decomposition matrices while keeping original weights frozen. The process covers layer replacement, parameter freezing, training, checkpoint saving with minimal storage, and weight merging for zero-latency inference.

Usage

Execute this workflow when you need to adapt a pretrained PyTorch model (any architecture) to a downstream task while minimizing the number of trainable parameters and checkpoint storage. This is particularly useful when GPU memory is limited, when you need to serve multiple task-specific adapters from the same base model, or when you want to avoid catastrophic forgetting of pretrained knowledge.

Execution Steps

Step 1: Install loralib

Install the loralib Python package which provides drop-in LoRA-enhanced replacements for standard PyTorch layers. The package supports nn.Linear, nn.Embedding, and nn.Conv2d layers, plus a special MergedLinear layer for combined projection matrices.

Key considerations:

  • Requires Python >= 3.6 and PyTorch
  • Can be installed from PyPI or directly from the GitHub repository
  • The package is lightweight with no dependencies beyond PyTorch

Step 2: Replace Target Layers

Identify which layers in the model should be adapted with LoRA and replace them with loralib equivalents. Typically, the query and value projection matrices in transformer attention layers are targeted, but any Linear, Embedding, or Conv layer can be adapted.

Key considerations:

  • Each LoRA layer requires specifying a rank r which controls the dimensionality of the low-rank decomposition
  • For combined QKV projections, use MergedLinear with an enable_lora mask to selectively apply LoRA to specific sub-projections
  • The fan_in_fan_out flag must be set to True for layers that store weights in transposed format (e.g., HuggingFace Conv1D)
  • Higher rank r means more trainable parameters but potentially better task performance
  • The scaling factor lora_alpha / r controls the magnitude of the LoRA update

Step 3: Freeze Non-LoRA Parameters

Before training begins, freeze all pretrained parameters so that only the injected LoRA matrices (lora_A and lora_B) are updated during backpropagation. Optionally, bias vectors can also be made trainable alongside LoRA parameters.

Key considerations:

  • Three bias modes are available: none (only LoRA params), all (all biases in model), lora_only (biases in LoRA-adapted layers only)
  • After freezing, the number of trainable parameters is typically less than 1% of the original model
  • The freezing utility identifies LoRA parameters by looking for "lora_" in parameter names

Step 4: Train the Model

Run the standard PyTorch training loop. Because most parameters are frozen, gradient computation is significantly reduced, leading to lower memory usage and faster training compared to full fine-tuning.

Key considerations:

  • Use standard optimizers (Adam, AdamW) but only LoRA parameters receive gradient updates
  • Learning rates may need to be higher than full fine-tuning since fewer parameters are being updated
  • Label smoothing and gradient clipping can be applied as usual
  • Distributed training works normally since LoRA is transparent to the training infrastructure

Step 5: Save LoRA Checkpoint

Save only the LoRA parameters to create a minimal checkpoint. This produces files that are orders of magnitude smaller than full model checkpoints (e.g., 1.5 MB vs 1.5 GB for GPT-2 Medium).

Key considerations:

  • The lora_state_dict utility extracts only parameters with "lora_" in their names
  • The bias mode used during saving must match the mode used during freezing
  • The original pretrained checkpoint is still needed to reconstruct the full model
  • Multiple task-specific LoRA checkpoints can share the same base model checkpoint

Step 6: Load and Merge for Inference

Load the pretrained base checkpoint and LoRA checkpoint separately, then switch to eval mode to merge LoRA weights into the base weights. This produces a model with identical architecture to the original, introducing zero additional inference latency.

Key considerations:

  • Load with strict=False since the LoRA checkpoint only contains a subset of parameters
  • Load the pretrained checkpoint first, then the LoRA checkpoint
  • Calling model.eval() triggers automatic weight merging (W' = W + BA * scaling)
  • Calling model.train() un-merges the weights for continued training
  • Weight merging can be disabled by setting merge_weights=False on LoRA layers

Execution Diagram

GitHub URL

Workflow Repository