Principle:PacktPublishing LLM Engineers Handbook LoRA Adapter Injection
| Field | Value |
|---|---|
| Principle Name | LoRA Adapter Injection |
| Category | Low-Rank Adaptation for Parameter-Efficient Fine-tuning |
| Workflow | LLM_Finetuning |
| Repo | PacktPublishing/LLM-Engineers-Handbook |
| Implemented by | Implementation:PacktPublishing_LLM_Engineers_Handbook_FastLanguageModel_Get_Peft_Model |
Overview
Low-Rank Adaptation (LoRA) is a parameter-efficient fine-tuning technique that injects small, trainable low-rank decomposition matrices into existing model layers. This enables fine-tuning with a fraction of the total parameters (often <1%) while maintaining performance comparable to full fine-tuning. The original pre-trained weights remain frozen, and only the injected adapter weights are updated during training.
Theory
The Full Fine-tuning Problem
Full fine-tuning updates every weight in the model. For a 7B parameter model, this means:
- 7 billion trainable parameters requiring gradient computation and optimizer states.
- Optimizer states (e.g., AdamW) require 2x additional memory (first and second moments).
- Total memory for full fine-tuning can exceed 60-80 GB for a 7B model.
LoRA Solution
LoRA hypothesizes that the weight updates during fine-tuning have a low intrinsic rank. Instead of updating the full weight matrix W, LoRA decomposes the update into two low-rank matrices:
W' = W + Delta_W = W + B * A
Where:
W in R^{d x k} -- original frozen weight matrix
B in R^{d x r} -- low-rank down-projection (trainable)
A in R^{r x k} -- low-rank up-projection (trainable)
r << min(d, k) -- rank (controls expressiveness vs. efficiency)
Mathematical Basis
For a linear layer y = Wx, the LoRA-modified forward pass becomes:
y = Wx + (alpha / r) * BAx
Where:
- r (rank): Controls the expressiveness of the adapter. Higher rank = more parameters but more expressive.
- alpha (lora_alpha): A scaling factor that controls the magnitude of the adapter's contribution. The effective scaling is
alpha / r. - B is initialized to zeros, A is initialized with random Gaussian values, so the adapter starts as an identity (no modification to the original model).
Parameter Efficiency
For a single weight matrix W in R^{d x k} with rank r:
| Approach | Trainable Parameters |
|---|---|
| Full fine-tuning | d * k
|
| LoRA (rank r) | r * (d + k)
|
For a typical transformer attention layer with d = k = 4096 and r = 32:
- Full fine-tuning: 16,777,216 parameters
- LoRA: 262,144 parameters (1.56% of full)
Target Modules
LoRA adapters are typically injected into the attention and MLP projection layers of transformer models:
- Attention:
q_proj,k_proj,v_proj,o_proj-- query, key, value, and output projections. - MLP:
up_proj,down_proj,gate_proj-- feed-forward network projections.
Injecting into all these modules provides comprehensive adaptation of the model's representational capacity.
Dropout
lora_dropout applies dropout to the adapter output during training. Setting it to 0 (as in this repository) means no regularization on the adapter, which is common when the training data is sufficient and overfitting is not a concern.
When to Use
- When fine-tuning a large pre-trained model efficiently without modifying all parameters.
- When GPU memory is limited and full fine-tuning is infeasible.
- When you want to maintain multiple fine-tuned variants of the same base model (each adapter is only a few hundred MB).
- When quick experimentation with different fine-tuning configurations is needed.
When Not to Use
- When maximum fine-tuning quality is needed and full fine-tuning resources are available.
- When the task requires modifying the model architecture beyond linear layers.
- When the rank needed to capture the task is close to the full weight dimensions (negating efficiency gains).
Related Papers
- LoRA: Hu, E., Shen, Y., Wallis, P., et al. (2021). LoRA: Low-Rank Adaptation of Large Language Models.
- QLoRA: Dettmers, T., et al. (2023). QLoRA: Efficient Finetuning of Quantized Language Models.
- PEFT: HuggingFace Parameter-Efficient Fine-Tuning library.