Principle:PacktPublishing LLM Engineers Handbook LoRA Adapter Injection

Field	Value
Principle Name	LoRA Adapter Injection
Category	Low-Rank Adaptation for Parameter-Efficient Fine-tuning
Workflow	LLM_Finetuning
Repo	PacktPublishing/LLM-Engineers-Handbook
Implemented by	Implementation:PacktPublishing_LLM_Engineers_Handbook_FastLanguageModel_Get_Peft_Model

Overview

Low-Rank Adaptation (LoRA) is a parameter-efficient fine-tuning technique that injects small, trainable low-rank decomposition matrices into existing model layers. This enables fine-tuning with a fraction of the total parameters (often <1%) while maintaining performance comparable to full fine-tuning. The original pre-trained weights remain frozen, and only the injected adapter weights are updated during training.

Theory

The Full Fine-tuning Problem

Full fine-tuning updates every weight in the model. For a 7B parameter model, this means:

7 billion trainable parameters requiring gradient computation and optimizer states.
Optimizer states (e.g., AdamW) require 2x additional memory (first and second moments).
Total memory for full fine-tuning can exceed 60-80 GB for a 7B model.

LoRA Solution

LoRA hypothesizes that the weight updates during fine-tuning have a low intrinsic rank. Instead of updating the full weight matrix W, LoRA decomposes the update into two low-rank matrices:

W' = W + Delta_W = W + B * A

Where:
  W in R^{d x k}       -- original frozen weight matrix
  B in R^{d x r}       -- low-rank down-projection (trainable)
  A in R^{r x k}       -- low-rank up-projection (trainable)
  r << min(d, k)       -- rank (controls expressiveness vs. efficiency)

Mathematical Basis

For a linear layer y = Wx, the LoRA-modified forward pass becomes:

y = Wx + (alpha / r) * BAx

Where:

r (rank): Controls the expressiveness of the adapter. Higher rank = more parameters but more expressive.
alpha (lora_alpha): A scaling factor that controls the magnitude of the adapter's contribution. The effective scaling is alpha / r.
B is initialized to zeros, A is initialized with random Gaussian values, so the adapter starts as an identity (no modification to the original model).

Parameter Efficiency

For a single weight matrix W in R^{d x k} with rank r:

Approach	Trainable Parameters
Full fine-tuning	`d * k`
LoRA (rank r)	`r * (d + k)`

For a typical transformer attention layer with d = k = 4096 and r = 32:

Full fine-tuning: 16,777,216 parameters
LoRA: 262,144 parameters (1.56% of full)

Target Modules

LoRA adapters are typically injected into the attention and MLP projection layers of transformer models:

Attention: q_proj, k_proj, v_proj, o_proj -- query, key, value, and output projections.
MLP: up_proj, down_proj, gate_proj -- feed-forward network projections.

Injecting into all these modules provides comprehensive adaptation of the model's representational capacity.

Dropout

lora_dropout applies dropout to the adapter output during training. Setting it to 0 (as in this repository) means no regularization on the adapter, which is common when the training data is sufficient and overfitting is not a concern.

When to Use

When fine-tuning a large pre-trained model efficiently without modifying all parameters.
When GPU memory is limited and full fine-tuning is infeasible.
When you want to maintain multiple fine-tuned variants of the same base model (each adapter is only a few hundred MB).
When quick experimentation with different fine-tuning configurations is needed.

When Not to Use

When maximum fine-tuning quality is needed and full fine-tuning resources are available.
When the task requires modifying the model architecture beyond linear layers.
When the rank needed to capture the task is close to the full weight dimensions (negating efficiency gains).

Related Papers

LoRA: Hu, E., Shen, Y., Wallis, P., et al. (2021). LoRA: Low-Rank Adaptation of Large Language Models.
QLoRA: Dettmers, T., et al. (2023). QLoRA: Efficient Finetuning of Quantized Language Models.
PEFT: HuggingFace Parameter-Efficient Fine-Tuning library.

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment