Principle:Lm sys FastChat LoRA Adapter Injection
| Knowledge Sources | |
|---|---|
| Domains | NLP, Training, Parameter-Efficient Fine-Tuning |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
Injecting Low-Rank Adaptation (LoRA) adapter layers into a pretrained language model, enabling parameter-efficient fine-tuning by training only a small number of low-rank decomposition matrices while keeping the base model weights frozen.
Description
LoRA (Low-Rank Adaptation of Large Language Models), introduced by Hu et al. (2021), is a parameter-efficient fine-tuning method that avoids modifying the original pretrained weights. Instead, it injects small trainable rank-decomposition matrices alongside existing weight matrices, typically in the attention layers. This approach reduces the number of trainable parameters by orders of magnitude while achieving performance comparable to full fine-tuning.
The LoRA adapter injection process in FastChat involves several key decisions:
- Rank Selection (r) -- The rank
rof the decomposition matrices determines the expressivity of the adaptation. FastChat defaults tor=8. Lower ranks reduce memory and compute but may underfit; higher ranks increase capacity but approach the cost of full fine-tuning. - Alpha Scaling -- The
lora_alphaparameter controls the scaling factor applied to the LoRA output. The effective scaling isalpha / r. FastChat defaults tolora_alpha=16, giving an effective scaling of16/8 = 2.0. This hyperparameter adjusts how much the adaptation affects the output relative to the frozen base weights. - Target Module Selection -- LoRA adapters can be attached to any linear layer. FastChat targets the attention query and value projection matrices:
["q_proj", "v_proj"]. These are the most impactful modules for adaptation based on empirical studies (Hu et al., 2021). Other common choices include addingk_proj,o_proj, or MLP layers. - Dropout -- A dropout rate of
0.05is applied to the LoRA layers to regularize the adaptation and prevent overfitting, particularly important when fine-tuning on small datasets. - Task Type -- Set to
"CAUSAL_LM"to indicate that the model is a causal (autoregressive) language model. This affects how the PEFT library handles the output head. - Bias Handling -- The
biasparameter (default"none") controls whether bias terms are trained alongside the LoRA adapters. Options are"none"(no bias training),"all"(train all biases), and"lora_only"(train only biases in LoRA-modified layers). - Preparation for k-bit Training -- When using QLoRA (
q_lora=True), the model must be prepared for quantized training viaprepare_model_for_kbit_training(). This function freezes the base model, casts layer norms to float32 for stability, and optionally enables gradient checkpointing.
Usage
Use this pattern when:
- Fine-tuning a large language model with limited GPU memory or compute budget.
- You want to maintain the base model weights unchanged and only store small adapter files.
- Multiple task-specific adaptations of the same base model are needed (each adapter is only a few MB).
- Combining with QLoRA for maximum memory efficiency on consumer hardware.
Do not use this pattern when:
- Full fine-tuning is feasible and maximum adaptation quality is required.
- The model architecture does not have identifiable linear projection layers for targeting.
Theoretical Basis
Low-Rank Decomposition: For a pretrained weight matrix W_0 of dimension d x d, LoRA models the weight update as a low-rank product:
W = W_0 + delta_W = W_0 + B * A
where B is a d x r matrix and A is an r x d matrix, with rank r << d. The number of trainable parameters per adapted layer is 2 * d * r, compared to d * d for full fine-tuning.
Initialization: Matrix A is initialized with a random Gaussian distribution and B is initialized to zero, so at the start of training delta_W = B * A = 0 and the model behaves identically to the pretrained model.
Forward Pass: For input x, the adapted layer computes:
h = W_0 @ x + (alpha / r) * B @ A @ x
The scaling factor alpha / r controls the magnitude of the adaptation relative to the frozen weights. With FastChat's defaults (alpha=16, r=8), this scaling is 2.0.
Parameter Efficiency: For a 7B parameter model with LoRA applied to q_proj and v_proj (each 4096 x 4096 for LLaMA-7B) across 32 layers:
Trainable params = 32 layers * 2 modules * 2 * 4096 * 8 = 4,194,304
Percentage of total = 4.19M / 6.74B ~ 0.06%
Gradient Checkpointing with k-bit Training: When prepare_model_for_kbit_training() is called with use_gradient_checkpointing=True, intermediate activations are recomputed during the backward pass rather than stored in memory. This trades compute time for memory, enabling training of larger models or with larger batch sizes.