Principle:Huggingface Trl PEFT LoRA Configuration SFT
| Knowledge Sources | |
|---|---|
| Domains | NLP, Training |
| Last Updated | 2026-02-06 17:00 GMT |
Overview
Low-Rank Adaptation (LoRA) for parameter-efficient fine-tuning, where small trainable adapter matrices are injected into frozen pretrained model layers to dramatically reduce the number of trainable parameters.
Description
Fine-tuning all parameters of a large language model is computationally expensive and requires storing a full copy of the model weights for each downstream task. LoRA (Low-Rank Adaptation), introduced by Hu et al. (2021), addresses this by freezing the pretrained model weights and injecting small, trainable low-rank decomposition matrices into selected layers. This reduces the number of trainable parameters by orders of magnitude while maintaining comparable performance to full fine-tuning.
The key insight is that the weight updates during fine-tuning have a low intrinsic rank. Instead of learning a full update matrix Delta W of shape (d x k), LoRA decomposes it into two smaller matrices:
Aof shape(d x r)(the down-projection)Bof shape(r x k)(the up-projection)
where r << min(d, k) is the rank of the adaptation. During training, only A and B receive gradients; the original weight matrix W remains frozen.
In the TRL SFT workflow, LoRA configuration is managed through the ModelConfig dataclass, which exposes LoRA-specific fields (use_peft, lora_r, lora_alpha, lora_dropout, etc.). The get_peft_config() utility function translates these settings into a LoraConfig object from the PEFT library, which is then passed to the SFTTrainer.
Usage
Use LoRA configuration when:
- Fine-tuning large models (7B+ parameters) on limited GPU memory.
- Running QLoRA where the base model is quantized to 4-bit or 8-bit.
- Training multiple task-specific adapters that can be swapped at inference time.
- Needing faster training iteration since fewer parameters require gradient computation and optimizer state.
Theoretical Basis
LoRA Formulation: For a pretrained weight matrix W_0 in R^{d x k}, the modified forward pass is:
h = W_0 @ x + (alpha / r) * B @ A @ x
where:
A in R^{r x k}is initialized with a random Gaussian distributionB in R^{d x r}is initialized to zero (so the adapter starts as identity)alphais a scaling hyperparameter that controls the magnitude of the adapter's contributionris the rank of the decomposition
Scaling: The ratio alpha / r acts as a learning rate multiplier for the adapter. When use_rslora=True, Rank-Stabilized LoRA uses alpha / sqrt(r) instead, which provides more stable training across different rank values.
Parameter Efficiency: For a single linear layer of dimension d x k, LoRA reduces trainable parameters from d * k to r * (d + k). With typical values (e.g., d = k = 4096, r = 16), this is a reduction factor of roughly 250x per layer.
DoRA Extension: Weight-Decomposed LoRA (DoRA) further decomposes weight updates into magnitude and direction components. The direction is handled by standard LoRA, while magnitude is learned by a separate scalar parameter, improving performance especially at low ranks.
| Parameter | Default | Description |
|---|---|---|
lora_r |
16 | Rank of the low-rank matrices |
lora_alpha |
32 | Scaling factor (effective LR multiplier is alpha/r)
|
lora_dropout |
0.05 | Dropout probability applied to the adapter input |
lora_target_modules |
None | Which layers to inject adapters into (None = library default) |
lora_task_type |
"CAUSAL_LM" | Task type for the PEFT adapter |
bias |
"none" | Whether to train bias terms (always "none" in TRL's config) |