Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Huggingface Trl PEFT LoRA Configuration SFT

From Leeroopedia


Knowledge Sources
Domains NLP, Training
Last Updated 2026-02-06 17:00 GMT

Overview

Low-Rank Adaptation (LoRA) for parameter-efficient fine-tuning, where small trainable adapter matrices are injected into frozen pretrained model layers to dramatically reduce the number of trainable parameters.

Description

Fine-tuning all parameters of a large language model is computationally expensive and requires storing a full copy of the model weights for each downstream task. LoRA (Low-Rank Adaptation), introduced by Hu et al. (2021), addresses this by freezing the pretrained model weights and injecting small, trainable low-rank decomposition matrices into selected layers. This reduces the number of trainable parameters by orders of magnitude while maintaining comparable performance to full fine-tuning.

The key insight is that the weight updates during fine-tuning have a low intrinsic rank. Instead of learning a full update matrix Delta W of shape (d x k), LoRA decomposes it into two smaller matrices:

  • A of shape (d x r) (the down-projection)
  • B of shape (r x k) (the up-projection)

where r << min(d, k) is the rank of the adaptation. During training, only A and B receive gradients; the original weight matrix W remains frozen.

In the TRL SFT workflow, LoRA configuration is managed through the ModelConfig dataclass, which exposes LoRA-specific fields (use_peft, lora_r, lora_alpha, lora_dropout, etc.). The get_peft_config() utility function translates these settings into a LoraConfig object from the PEFT library, which is then passed to the SFTTrainer.

Usage

Use LoRA configuration when:

  • Fine-tuning large models (7B+ parameters) on limited GPU memory.
  • Running QLoRA where the base model is quantized to 4-bit or 8-bit.
  • Training multiple task-specific adapters that can be swapped at inference time.
  • Needing faster training iteration since fewer parameters require gradient computation and optimizer state.

Theoretical Basis

LoRA Formulation: For a pretrained weight matrix W_0 in R^{d x k}, the modified forward pass is:

h = W_0 @ x + (alpha / r) * B @ A @ x

where:

  • A in R^{r x k} is initialized with a random Gaussian distribution
  • B in R^{d x r} is initialized to zero (so the adapter starts as identity)
  • alpha is a scaling hyperparameter that controls the magnitude of the adapter's contribution
  • r is the rank of the decomposition

Scaling: The ratio alpha / r acts as a learning rate multiplier for the adapter. When use_rslora=True, Rank-Stabilized LoRA uses alpha / sqrt(r) instead, which provides more stable training across different rank values.

Parameter Efficiency: For a single linear layer of dimension d x k, LoRA reduces trainable parameters from d * k to r * (d + k). With typical values (e.g., d = k = 4096, r = 16), this is a reduction factor of roughly 250x per layer.

DoRA Extension: Weight-Decomposed LoRA (DoRA) further decomposes weight updates into magnitude and direction components. The direction is handled by standard LoRA, while magnitude is learned by a separate scalar parameter, improving performance especially at low ranks.

Parameter Default Description
lora_r 16 Rank of the low-rank matrices
lora_alpha 32 Scaling factor (effective LR multiplier is alpha/r)
lora_dropout 0.05 Dropout probability applied to the adapter input
lora_target_modules None Which layers to inject adapters into (None = library default)
lora_task_type "CAUSAL_LM" Task type for the PEFT adapter
bias "none" Whether to train bias terms (always "none" in TRL's config)

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment