Principle:Axolotl ai cloud Axolotl LoRA Adapter Injection
| Knowledge Sources | |
|---|---|
| Domains | Parameter_Efficient_Finetuning, Model_Architecture, Memory_Optimization |
| Last Updated | 2026-02-06 23:00 GMT |
Overview
A parameter-efficient fine-tuning technique that injects trainable low-rank decomposition matrices alongside frozen pre-trained model weights.
Description
LoRA (Low-Rank Adaptation) injects small, trainable matrices into specific layers of a pre-trained model while keeping the original weights frozen. Instead of fine-tuning all model parameters (which for a 7B model means 7 billion trainable parameters), LoRA adds pairs of low-rank matrices (A and B) to targeted layers, reducing trainable parameters to typically 0.1-1% of the original model.
The key insight is that weight updates during fine-tuning have a low intrinsic rank. By decomposing the update matrix into two smaller matrices (rank decomposition), LoRA achieves comparable quality to full fine-tuning with dramatically fewer trainable parameters and lower memory requirements.
In Axolotl, LoRA injection is handled by the load_lora function which creates a LoraConfig from the YAML configuration and wraps the model using HuggingFace PEFT's get_peft_model.
Usage
Use LoRA adapter injection when:
- Fine-tuning large models with limited GPU memory
- Using QLoRA (combined with 4-bit quantization)
- Training task-specific adapters that can be swapped at inference
- Requiring multiple specialized models from a single base model
Theoretical Basis
For a pre-trained weight matrix , LoRA adds a low-rank update:
Where and , with rank .
Key parameters:
- Rank (r): Controls adapter capacity. Typical values: 8-64
- Alpha (α): Scaling factor applied as . Controls update magnitude
- Target modules: Which layers receive LoRA injection (attention, MLP, etc.)
- Dropout: Applied to LoRA layers for regularization
Forward pass:
# Pseudo-code for LoRA forward pass
h = W_0 @ x + (B @ A) @ x * (alpha / r)
# B initialized to zero, A initialized randomly
# During training: W_0 frozen, only A and B updated
Memory savings:
- Full fine-tuning: trainable parameters per layer
- LoRA: trainable parameters per layer
- For d=k=4096, r=16: 98.4% parameter reduction