Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:PacktPublishing LLM Engineers Handbook LoRA Adapter Injection

From Leeroopedia


Field Value
Principle Name LoRA Adapter Injection
Category Low-Rank Adaptation for Parameter-Efficient Fine-tuning
Workflow LLM_Finetuning
Repo PacktPublishing/LLM-Engineers-Handbook
Implemented by Implementation:PacktPublishing_LLM_Engineers_Handbook_FastLanguageModel_Get_Peft_Model

Overview

Low-Rank Adaptation (LoRA) is a parameter-efficient fine-tuning technique that injects small, trainable low-rank decomposition matrices into existing model layers. This enables fine-tuning with a fraction of the total parameters (often <1%) while maintaining performance comparable to full fine-tuning. The original pre-trained weights remain frozen, and only the injected adapter weights are updated during training.

Theory

The Full Fine-tuning Problem

Full fine-tuning updates every weight in the model. For a 7B parameter model, this means:

  • 7 billion trainable parameters requiring gradient computation and optimizer states.
  • Optimizer states (e.g., AdamW) require 2x additional memory (first and second moments).
  • Total memory for full fine-tuning can exceed 60-80 GB for a 7B model.

LoRA Solution

LoRA hypothesizes that the weight updates during fine-tuning have a low intrinsic rank. Instead of updating the full weight matrix W, LoRA decomposes the update into two low-rank matrices:

W' = W + Delta_W = W + B * A

Where:
  W in R^{d x k}       -- original frozen weight matrix
  B in R^{d x r}       -- low-rank down-projection (trainable)
  A in R^{r x k}       -- low-rank up-projection (trainable)
  r << min(d, k)       -- rank (controls expressiveness vs. efficiency)

Mathematical Basis

For a linear layer y = Wx, the LoRA-modified forward pass becomes:

y = Wx + (alpha / r) * BAx

Where:

  • r (rank): Controls the expressiveness of the adapter. Higher rank = more parameters but more expressive.
  • alpha (lora_alpha): A scaling factor that controls the magnitude of the adapter's contribution. The effective scaling is alpha / r.
  • B is initialized to zeros, A is initialized with random Gaussian values, so the adapter starts as an identity (no modification to the original model).

Parameter Efficiency

For a single weight matrix W in R^{d x k} with rank r:

Approach Trainable Parameters
Full fine-tuning d * k
LoRA (rank r) r * (d + k)

For a typical transformer attention layer with d = k = 4096 and r = 32:

  • Full fine-tuning: 16,777,216 parameters
  • LoRA: 262,144 parameters (1.56% of full)

Target Modules

LoRA adapters are typically injected into the attention and MLP projection layers of transformer models:

  • Attention: q_proj, k_proj, v_proj, o_proj -- query, key, value, and output projections.
  • MLP: up_proj, down_proj, gate_proj -- feed-forward network projections.

Injecting into all these modules provides comprehensive adaptation of the model's representational capacity.

Dropout

lora_dropout applies dropout to the adapter output during training. Setting it to 0 (as in this repository) means no regularization on the adapter, which is common when the training data is sufficient and overfitting is not a concern.

When to Use

  • When fine-tuning a large pre-trained model efficiently without modifying all parameters.
  • When GPU memory is limited and full fine-tuning is infeasible.
  • When you want to maintain multiple fine-tuned variants of the same base model (each adapter is only a few hundred MB).
  • When quick experimentation with different fine-tuning configurations is needed.

When Not to Use

  • When maximum fine-tuning quality is needed and full fine-tuning resources are available.
  • When the task requires modifying the model architecture beyond linear layers.
  • When the rank needed to capture the task is close to the full weight dimensions (negating efficiency gains).

Related Papers

See Also

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment