Workflow:Lm sys FastChat LoRA QLoRA Finetuning
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Fine_Tuning, Parameter_Efficient |
| Last Updated | 2026-02-07 04:00 GMT |
Overview
End-to-end process for parameter-efficient fine-tuning of LLaMA-based models using LoRA or QLoRA (4-bit quantization) with DeepSpeed, enabling training of large models on consumer-grade GPUs.
Description
This workflow covers the procedure for training LoRA (Low-Rank Adaptation) adapters on top of a frozen base model. When QLoRA mode is enabled, the base model weights are quantized to 4-bit NormalFloat format using BitsAndBytes, dramatically reducing memory requirements. The LoRA adapter injects small trainable rank-decomposition matrices into the model's attention layers (q_proj, v_proj by default). Only these adapter weights are trained, representing less than 1% of total parameters. The workflow uses DeepSpeed ZeRO-2 for efficient distributed training and saves only the adapter weights for deployment.
Usage
Execute this workflow when you need to fine-tune a large language model but have limited GPU resources (e.g., a single GPU with 16-24GB VRAM). QLoRA enables training of 7B+ parameter models on consumer GPUs. This is also appropriate when you want to maintain the base model unchanged and create lightweight, swappable adapters for different tasks.
Execution Steps
Step 1: Environment Setup
Install FastChat with training dependencies along with PEFT and BitsAndBytes libraries. For QLoRA, bitsandbytes>=0.39.0 and transformers>=4.30.0 are required. DeepSpeed is needed for the distributed training launcher.
Key considerations:
- QLoRA requires bitsandbytes with CUDA support
- Flash Attention can be optionally enabled for LLaMA models
- ZeRO-3 is incompatible with QLoRA; use ZeRO-2 instead
- ZeRO-3 does work with standard LoRA (without quantization)
Step 2: Data Preparation
Prepare conversation data in ShareGPT JSON format, identical to the full SFT workflow. The same data preprocessing pipeline applies: the conversations array with alternating human/gpt turns, cleaned and split to fit within the model's max sequence length.
Key considerations:
- Same data format as full SFT: JSON with "conversations" field
- Use data/dummy_conversation.json for testing
- The preprocessing and tokenization logic is shared with the full SFT train.py
Step 3: Model Loading with Quantization
Load the base model with optional 4-bit quantization. In QLoRA mode, the BitsAndBytesConfig applies NF4 quantization with double quantization for additional memory savings. The compute dtype is set based on the training precision (fp16/bf16). Device mapping is configured for distributed training compatibility.
Key considerations:
- 4-bit NF4 quantization with double quantization is the default QLoRA configuration
- Device mapping must be set correctly for DDP (one GPU per rank)
- FSDP and ZeRO-3 are incompatible with QLoRA; a warning is logged
- Without QLoRA, the model loads in full precision with standard LoRA
Step 4: LoRA Adapter Injection
Configure and inject LoRA adapters into the model. The LoRA configuration specifies rank (r), alpha scaling factor, dropout, and target modules. For QLoRA, the model is first prepared for k-bit training which handles gradient computation through quantized weights. The PEFT library wraps the model with trainable adapter layers.
Key considerations:
- Default target modules: q_proj and v_proj (attention query and value projections)
- Default hyperparameters: r=8, alpha=16, dropout=0.05
- prepare_model_for_kbit_training handles quantized gradient flow in QLoRA mode
- Flash Attention compatibility requires casting norm and embedding layers to compute dtype
- Gradient checkpointing requires enabling input_require_grads on the model
Step 5: Distributed Training with DeepSpeed
Launch the training loop using HuggingFace Trainer with DeepSpeed ZeRO-2 backend. The tokenizer and data module are configured identically to the full SFT workflow. Training uses the same supervised data module with conversation preprocessing and target masking.
Key considerations:
- Use the DeepSpeed ZeRO-2 config from playground/deepspeed_config_s2.json
- Standard hyperparameters: lr 2e-5, 3 epochs, cosine scheduler, warmup 0.03
- The pad token is set to the unknown token
- Checkpoint resumption is automatic if existing checkpoints are detected
- Training can be launched via: deepspeed fastchat/train/train_lora.py --deepspeed ...
Step 6: Adapter Saving
Save only the LoRA adapter weights. In ZeRO-3 mode, the consolidated 16-bit state dict is gathered across ranks. In other modes, the LoRA parameters are extracted from the full model state using a custom extraction function that handles DeepSpeed ZeRO parameter partitioning. Only rank 0 performs the save.
Key considerations:
- Only adapter weights (lora_ parameters) are saved, not the full base model
- ZeRO-3 uses the engine's internal consolidated state dict
- The bias saving strategy is configurable (none, all, lora_only)
- The saved adapter can be loaded with PEFT's from_pretrained for inference