Principle:Huggingface Trl PEFT LoRA Configuration SFT

Knowledge Sources	LoRA TRL TRL Docs
Domains	NLP, Training
Last Updated	2026-02-06 17:00 GMT

Overview

Low-Rank Adaptation (LoRA) for parameter-efficient fine-tuning, where small trainable adapter matrices are injected into frozen pretrained model layers to dramatically reduce the number of trainable parameters.

Description

Fine-tuning all parameters of a large language model is computationally expensive and requires storing a full copy of the model weights for each downstream task. LoRA (Low-Rank Adaptation), introduced by Hu et al. (2021), addresses this by freezing the pretrained model weights and injecting small, trainable low-rank decomposition matrices into selected layers. This reduces the number of trainable parameters by orders of magnitude while maintaining comparable performance to full fine-tuning.

The key insight is that the weight updates during fine-tuning have a low intrinsic rank. Instead of learning a full update matrix Delta W of shape (d x k), LoRA decomposes it into two smaller matrices:

A of shape (d x r) (the down-projection)
B of shape (r x k) (the up-projection)

where r << min(d, k) is the rank of the adaptation. During training, only A and B receive gradients; the original weight matrix W remains frozen.

In the TRL SFT workflow, LoRA configuration is managed through the ModelConfig dataclass, which exposes LoRA-specific fields (use_peft, lora_r, lora_alpha, lora_dropout, etc.). The get_peft_config() utility function translates these settings into a LoraConfig object from the PEFT library, which is then passed to the SFTTrainer.

Usage

Use LoRA configuration when:

Fine-tuning large models (7B+ parameters) on limited GPU memory.
Running QLoRA where the base model is quantized to 4-bit or 8-bit.
Training multiple task-specific adapters that can be swapped at inference time.
Needing faster training iteration since fewer parameters require gradient computation and optimizer state.

Theoretical Basis

LoRA Formulation: For a pretrained weight matrix W_0 in R^{d x k}, the modified forward pass is:

h = W_0 @ x + (alpha / r) * B @ A @ x

where:

A in R^{r x k} is initialized with a random Gaussian distribution
B in R^{d x r} is initialized to zero (so the adapter starts as identity)
alpha is a scaling hyperparameter that controls the magnitude of the adapter's contribution
r is the rank of the decomposition

Scaling: The ratio alpha / r acts as a learning rate multiplier for the adapter. When use_rslora=True, Rank-Stabilized LoRA uses alpha / sqrt(r) instead, which provides more stable training across different rank values.

Parameter Efficiency: For a single linear layer of dimension d x k, LoRA reduces trainable parameters from d * k to r * (d + k). With typical values (e.g., d = k = 4096, r = 16), this is a reduction factor of roughly 250x per layer.

DoRA Extension: Weight-Decomposed LoRA (DoRA) further decomposes weight updates into magnitude and direction components. The direction is handled by standard LoRA, while magnitude is learned by a separate scalar parameter, improving performance especially at low ranks.

Parameter	Default	Description
`lora_r`	16	Rank of the low-rank matrices
`lora_alpha`	32	Scaling factor (effective LR multiplier is `alpha/r`)
`lora_dropout`	0.05	Dropout probability applied to the adapter input
`lora_target_modules`	None	Which layers to inject adapters into (None = library default)
`lora_task_type`	"CAUSAL_LM"	Task type for the PEFT adapter
`bias`	"none"	Whether to train bias terms (always "none" in TRL's config)

Related Pages

Implemented By

Implementation:Huggingface_Trl_Get_Peft_Config_SFT

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment