Principle:Hiyouga LLaMA Factory Low Rank Adaptation
| Knowledge Sources | |
|---|---|
| Domains | Natural Language Processing, Parameter-Efficient Fine-Tuning, Transfer Learning |
| Last Updated | 2026-02-06 19:00 GMT |
Overview
A parameter-efficient fine-tuning technique that adapts large pretrained models by injecting trainable low-rank decomposition matrices into existing weight layers, dramatically reducing the number of trainable parameters while preserving model quality.
Description
Low-Rank Adaptation (LoRA), introduced by Hu et al. (2021), is a parameter-efficient fine-tuning (PEFT) method based on the hypothesis that the weight updates during fine-tuning have low intrinsic rank. Instead of updating all model parameters, LoRA freezes the pretrained weights and injects small trainable rank-decomposition matrices alongside the original weight matrices. This enables fine-tuning models with billions of parameters using a fraction of the GPU memory.
LoRA and its variants occupy a central position in modern LLM fine-tuning because:
- Memory efficiency: Only the low-rank matrices are stored in optimizer states and gradients, reducing memory requirements by orders of magnitude.
- No inference overhead: The adapter matrices can be merged back into the base weights after training, resulting in zero additional latency at inference.
- Modular adapters: Multiple task-specific adapters can be trained independently and swapped at inference time on a shared base model.
- Compatibility with quantization: LoRA can be applied to quantized models (QLoRA), enabling fine-tuning of very large models on consumer hardware.
The LLaMA-Factory framework supports several LoRA variants and related adapter methods:
- Standard LoRA: Trainable low-rank matrices and with configurable rank, alpha, and dropout.
- DoRA (Weight-Decomposed Low-Rank Adaptation): Decomposes weight updates into magnitude and direction components for improved training stability.
- rsLoRA (Rank-Stabilized LoRA): Adjusts the scaling factor to stabilize training across different rank values.
- PiSSA (Principal Singular Values and Singular Vectors Adaptation): Initializes LoRA matrices using the principal components of the pretrained weights via SVD.
- OFT (Orthogonal Fine-Tuning): An alternative adapter method that applies orthogonal transformations to preserve pretrained feature relationships.
Usage
Use LoRA when you want to:
- Fine-tune a large model with limited GPU memory (e.g., fine-tuning a 70B model on a single GPU with quantization).
- Train multiple task-specific adapters that share a common base model.
- Maintain the ability to merge adapters back into the base model for deployment.
- Apply fine-tuning to quantized models (QLoRA).
- Achieve competitive performance with full fine-tuning while training only a small fraction of parameters.
LoRA is appropriate for virtually all fine-tuning scenarios and is the recommended default in LLaMA-Factory when full-parameter training is infeasible.
Theoretical Basis
Low-Rank Weight Update
For a pretrained weight matrix , LoRA parameterizes the weight update as a low-rank decomposition:
where and are trainable matrices with rank . During training, is frozen and only and receive gradient updates.
Forward Pass
The modified forward pass for an input is:
where is the LoRA scaling factor (typically lora_alpha) and is the rank. The ratio controls the magnitude of the adapter's contribution relative to the pretrained weights. In LLaMA-Factory, the default scaling is .
Initialization
Standard LoRA initializes with a random Gaussian distribution and with zeros, ensuring the adapter contribution is zero at the start of training:
PiSSA initialization instead decomposes the pretrained weight using SVD and initializes the adapter with the principal components:
where the adapter captures the top- singular components and the frozen residual captures the rest.
rsLoRA Scaling
Standard LoRA uses a fixed scaling of , which can cause training dynamics to vary with rank. rsLoRA (Rank-Stabilized LoRA) adjusts the scaling to:
This stabilizes the magnitude of the adapter output across different rank choices, making hyperparameter tuning more consistent.
DoRA Decomposition
DoRA decomposes the weight update into magnitude and direction:
where is a learnable magnitude vector and denotes column-wise normalization. This decomposition allows the model to independently adapt the scale and direction of weight transformations.
Target Module Selection
LoRA adapters are typically applied to the linear projection layers in the transformer architecture. The lora_target parameter specifies which modules to adapt. The special value "all" automatically identifies all linear modules in the model (excluding certain forbidden modules such as vision encoders in multimodal models).
Parameter Count
For a single adapted layer with input dimension and output dimension , the number of trainable parameters is:
compared to for full fine-tuning. With typical values of to and , this represents a reduction of 100x or more.