Principle:Lm sys FastChat LoRA Adapter Injection

Knowledge Sources	LoRA QLoRA FastChat PEFT LoRA
Domains	NLP, Training, Parameter-Efficient Fine-Tuning
Last Updated	2026-02-07 14:00 GMT

Overview

Injecting Low-Rank Adaptation (LoRA) adapter layers into a pretrained language model, enabling parameter-efficient fine-tuning by training only a small number of low-rank decomposition matrices while keeping the base model weights frozen.

Description

LoRA (Low-Rank Adaptation of Large Language Models), introduced by Hu et al. (2021), is a parameter-efficient fine-tuning method that avoids modifying the original pretrained weights. Instead, it injects small trainable rank-decomposition matrices alongside existing weight matrices, typically in the attention layers. This approach reduces the number of trainable parameters by orders of magnitude while achieving performance comparable to full fine-tuning.

The LoRA adapter injection process in FastChat involves several key decisions:

Rank Selection (r) -- The rank r of the decomposition matrices determines the expressivity of the adaptation. FastChat defaults to r=8. Lower ranks reduce memory and compute but may underfit; higher ranks increase capacity but approach the cost of full fine-tuning.
Alpha Scaling -- The lora_alpha parameter controls the scaling factor applied to the LoRA output. The effective scaling is alpha / r. FastChat defaults to lora_alpha=16, giving an effective scaling of 16/8 = 2.0. This hyperparameter adjusts how much the adaptation affects the output relative to the frozen base weights.
Target Module Selection -- LoRA adapters can be attached to any linear layer. FastChat targets the attention query and value projection matrices: ["q_proj", "v_proj"]. These are the most impactful modules for adaptation based on empirical studies (Hu et al., 2021). Other common choices include adding k_proj, o_proj, or MLP layers.
Dropout -- A dropout rate of 0.05 is applied to the LoRA layers to regularize the adaptation and prevent overfitting, particularly important when fine-tuning on small datasets.
Task Type -- Set to "CAUSAL_LM" to indicate that the model is a causal (autoregressive) language model. This affects how the PEFT library handles the output head.
Bias Handling -- The bias parameter (default "none") controls whether bias terms are trained alongside the LoRA adapters. Options are "none" (no bias training), "all" (train all biases), and "lora_only" (train only biases in LoRA-modified layers).
Preparation for k-bit Training -- When using QLoRA (q_lora=True), the model must be prepared for quantized training via prepare_model_for_kbit_training(). This function freezes the base model, casts layer norms to float32 for stability, and optionally enables gradient checkpointing.

Usage

Use this pattern when:

Fine-tuning a large language model with limited GPU memory or compute budget.
You want to maintain the base model weights unchanged and only store small adapter files.
Multiple task-specific adaptations of the same base model are needed (each adapter is only a few MB).
Combining with QLoRA for maximum memory efficiency on consumer hardware.

Do not use this pattern when:

Full fine-tuning is feasible and maximum adaptation quality is required.
The model architecture does not have identifiable linear projection layers for targeting.

Theoretical Basis

Low-Rank Decomposition: For a pretrained weight matrix W_0 of dimension d x d, LoRA models the weight update as a low-rank product:

W = W_0 + delta_W = W_0 + B * A

where B is a d x r matrix and A is an r x d matrix, with rank r << d. The number of trainable parameters per adapted layer is 2 * d * r, compared to d * d for full fine-tuning.

Initialization: Matrix A is initialized with a random Gaussian distribution and B is initialized to zero, so at the start of training delta_W = B * A = 0 and the model behaves identically to the pretrained model.

Forward Pass: For input x, the adapted layer computes:

h = W_0 @ x + (alpha / r) * B @ A @ x

The scaling factor alpha / r controls the magnitude of the adaptation relative to the frozen weights. With FastChat's defaults (alpha=16, r=8), this scaling is 2.0.

Parameter Efficiency: For a 7B parameter model with LoRA applied to q_proj and v_proj (each 4096 x 4096 for LLaMA-7B) across 32 layers:

Trainable params = 32 layers * 2 modules * 2 * 4096 * 8 = 4,194,304
Percentage of total = 4.19M / 6.74B ~ 0.06%

Gradient Checkpointing with k-bit Training: When prepare_model_for_kbit_training() is called with use_gradient_checkpointing=True, intermediate activations are recomputed during the backward pass rather than stored in memory. This trades compute time for memory, enabling training of larger models or with larger batch sizes.

Related Pages

Implemented By

Implementation:Lm_sys_FastChat_Peft_Get_Peft_Model

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment