Workflow:LLMBook zh LLMBook zh github io LoRA Finetuning
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Fine_Tuning, Parameter_Efficient |
| Last Updated | 2026-02-08 04:30 GMT |
Overview
End-to-end Low-Rank Adaptation (LoRA) fine-tuning workflow for parameter-efficient training of large language models, from adapter injection through training to weight merging.
Description
This workflow implements parameter-efficient fine-tuning using Low-Rank Adaptation (LoRA). Instead of updating all model parameters, LoRA injects small trainable low-rank matrices into the frozen base model's linear layers. For each target weight matrix W of dimension d_in x d_out, two small matrices A (d_in x r) and B (r x d_out) are added, where r is much smaller than both d_in and d_out. During training, only A and B are updated while the original weights remain frozen, reducing trainable parameters to less than 1% of the total. After training, the LoRA adapter weights can be merged back into the base model for deployment without any inference overhead. The workflow uses the HuggingFace PEFT library for adapter management and supports DeepSpeed ZeRO-3 for distributed training.
Usage
Execute this workflow when you need to fine-tune a large language model but have limited GPU memory or want to train multiple task-specific adapters that share a common base model. LoRA is particularly effective when you cannot afford full fine-tuning (e.g., 7B+ parameter models on consumer GPUs) or when you want to maintain separate lightweight adapters for different downstream tasks.
Execution Steps
Step 1: LoRA Layer Design
Understand the low-rank adapter architecture that will be injected into the model. Each LoRA-modified linear layer extends the standard nn.Linear with two additional small matrices: A (input to low-rank space) and B (low-rank space to output). Matrix A is initialized with a normal distribution (std=0.02) and B is initialized to zeros, ensuring the adapter initially produces zero output and does not disturb the pre-trained weights. A dropout layer is applied to the input before passing through A for regularization.
Key considerations:
- The rank r controls the expressiveness vs. efficiency tradeoff (typical values: 8-64)
- Zero initialization of B ensures training starts from the pre-trained model's behavior
- The forward pass computes: output = W*x + B(A(dropout(x)))
- Dropout is applied only to the LoRA path, not the frozen weight path
Step 2: PEFT Configuration
Configure the LoRA adapter using the PEFT library's LoraConfig. Specify the task type (causal language modeling), rank r, scaling factor alpha, and dropout probability. Apply the configuration to the base model using get_peft_model, which automatically identifies target linear layers and injects LoRA adapters while freezing all original parameters.
Key considerations:
- The alpha parameter controls the scaling of LoRA outputs (effective scaling = alpha/r)
- Default target modules are the attention projection layers (q_proj, v_proj)
- The PEFT wrapper handles parameter freezing and gradient routing automatically
- TaskType.CAUSAL_LM ensures proper configuration for decoder-only models
Step 3: LoRA Training
Train the model with only the LoRA adapter parameters being updated. The training loop is identical to standard fine-tuning but operates on far fewer parameters. The HuggingFace Trainer manages the optimization process, and only the small adapter weights receive gradient updates. Checkpoints save only the adapter weights, not the full model.
Key considerations:
- Only adapter parameters (typically less than 1% of total) are trainable
- Training is significantly faster and requires less GPU memory than full fine-tuning
- BF16 mixed precision is used for additional memory savings
- Each checkpoint contains only the small adapter weights
Step 4: Adapter Merging
After training completes, merge the learned LoRA adapter weights back into the base model to produce a standalone model. Load each checkpoint as a PEFT model, call merge_and_unload to arithmetically combine the adapter weights with the frozen base weights (W_merged = W + B*A), and save the resulting full model. The merged model has identical architecture to the original and requires no special runtime support.
Key considerations:
- DeepSpeed ZeRO-3 must be disabled before merging (weights must be fully materialized)
- Each checkpoint directory is processed independently
- The merged model is saved alongside the tokenizer for self-contained deployment
- After merging, the model has zero inference overhead compared to the original architecture