Principle:Hiyouga LLaMA Factory Low Rank Adaptation

Knowledge Sources	Hiyouga_LLaMA_Factory LoRA: Low-Rank Adaptation of Large Language Models
Domains	Natural Language Processing, Parameter-Efficient Fine-Tuning, Transfer Learning
Last Updated	2026-02-06 19:00 GMT

Overview

A parameter-efficient fine-tuning technique that adapts large pretrained models by injecting trainable low-rank decomposition matrices into existing weight layers, dramatically reducing the number of trainable parameters while preserving model quality.

Description

Low-Rank Adaptation (LoRA), introduced by Hu et al. (2021), is a parameter-efficient fine-tuning (PEFT) method based on the hypothesis that the weight updates during fine-tuning have low intrinsic rank. Instead of updating all model parameters, LoRA freezes the pretrained weights and injects small trainable rank-decomposition matrices alongside the original weight matrices. This enables fine-tuning models with billions of parameters using a fraction of the GPU memory.

LoRA and its variants occupy a central position in modern LLM fine-tuning because:

Memory efficiency: Only the low-rank matrices are stored in optimizer states and gradients, reducing memory requirements by orders of magnitude.
No inference overhead: The adapter matrices can be merged back into the base weights after training, resulting in zero additional latency at inference.
Modular adapters: Multiple task-specific adapters can be trained independently and swapped at inference time on a shared base model.
Compatibility with quantization: LoRA can be applied to quantized models (QLoRA), enabling fine-tuning of very large models on consumer hardware.

The LLaMA-Factory framework supports several LoRA variants and related adapter methods:

Standard LoRA: Trainable low-rank matrices $A$ and $B$ with configurable rank, alpha, and dropout.
DoRA (Weight-Decomposed Low-Rank Adaptation): Decomposes weight updates into magnitude and direction components for improved training stability.
rsLoRA (Rank-Stabilized LoRA): Adjusts the scaling factor to stabilize training across different rank values.
PiSSA (Principal Singular Values and Singular Vectors Adaptation): Initializes LoRA matrices using the principal components of the pretrained weights via SVD.
OFT (Orthogonal Fine-Tuning): An alternative adapter method that applies orthogonal transformations to preserve pretrained feature relationships.

Usage

Use LoRA when you want to:

Fine-tune a large model with limited GPU memory (e.g., fine-tuning a 70B model on a single GPU with quantization).
Train multiple task-specific adapters that share a common base model.
Maintain the ability to merge adapters back into the base model for deployment.
Apply fine-tuning to quantized models (QLoRA).
Achieve competitive performance with full fine-tuning while training only a small fraction of parameters.

LoRA is appropriate for virtually all fine-tuning scenarios and is the recommended default in LLaMA-Factory when full-parameter training is infeasible.

Theoretical Basis

Low-Rank Weight Update

For a pretrained weight matrix $W_{0} \in ℝ^{d \times k}$ , LoRA parameterizes the weight update as a low-rank decomposition:

$W = W_{0} + Δ W = W_{0} + B A$

where $B \in ℝ^{d \times r}$ and $A \in ℝ^{r \times k}$ are trainable matrices with rank $r ≪ \min (d, k)$ . During training, $W_{0}$ is frozen and only $A$ and $B$ receive gradient updates.

Forward Pass

The modified forward pass for an input $x$ is:

$h = W_{0} x + \frac{α}{r} B A x$

where $α$ is the LoRA scaling factor (typically lora_alpha) and $r$ is the rank. The ratio $\frac{α}{r}$ controls the magnitude of the adapter's contribution relative to the pretrained weights. In LLaMA-Factory, the default scaling is $α = 2 r$ .

Initialization

Standard LoRA initializes $A$ with a random Gaussian distribution and $B$ with zeros, ensuring the adapter contribution is zero at the start of training:

$Δ W_{init} = B_{0} A_{0} = 𝟎 \cdot A_{0} = 𝟎$

PiSSA initialization instead decomposes the pretrained weight using SVD and initializes the adapter with the principal components:

$W_{0} = U Σ V^{⊤} \approx U_{r} Σ_{r} V_{r}^{⊤} + U_{res} Σ_{res} V_{res}^{⊤}$

where the adapter captures the top- $r$ singular components and the frozen residual captures the rest.

rsLoRA Scaling

Standard LoRA uses a fixed scaling of $\frac{α}{r}$ , which can cause training dynamics to vary with rank. rsLoRA (Rank-Stabilized LoRA) adjusts the scaling to:

$\frac{α}{\sqrt{r}}$

This stabilizes the magnitude of the adapter output across different rank choices, making hyperparameter tuning more consistent.

DoRA Decomposition

DoRA decomposes the weight update into magnitude and direction:

$W^{'} = m \cdot \frac{W_{0} + B A}{‖ W_{0} + B A ‖_{c}}$

where $m$ is a learnable magnitude vector and $‖ \cdot ‖_{c}$ denotes column-wise normalization. This decomposition allows the model to independently adapt the scale and direction of weight transformations.

Target Module Selection

LoRA adapters are typically applied to the linear projection layers in the transformer architecture. The lora_target parameter specifies which modules to adapt. The special value "all" automatically identifies all linear modules in the model (excluding certain forbidden modules such as vision encoders in multimodal models).

Parameter Count

For a single adapted layer with input dimension $k$ and output dimension $d$ , the number of trainable parameters is:

${params}_{LoRA} = r \times (d + k)$

compared to $d \times k$ for full fine-tuning. With typical values of $r = 8$ to $64$ and $d, k \geq 4096$ , this represents a reduction of 100x or more.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment