Workflow:Lm sys FastChat LoRA QLoRA Finetuning

Knowledge Sources	FastChat PEFT Documentation BitsAndBytes QLoRA Paper
Domains	LLMs, Fine_Tuning, Parameter_Efficient
Last Updated	2026-02-07 04:00 GMT

Overview

End-to-end process for parameter-efficient fine-tuning of LLaMA-based models using LoRA or QLoRA (4-bit quantization) with DeepSpeed, enabling training of large models on consumer-grade GPUs.

Description

This workflow covers the procedure for training LoRA (Low-Rank Adaptation) adapters on top of a frozen base model. When QLoRA mode is enabled, the base model weights are quantized to 4-bit NormalFloat format using BitsAndBytes, dramatically reducing memory requirements. The LoRA adapter injects small trainable rank-decomposition matrices into the model's attention layers (q_proj, v_proj by default). Only these adapter weights are trained, representing less than 1% of total parameters. The workflow uses DeepSpeed ZeRO-2 for efficient distributed training and saves only the adapter weights for deployment.

Usage

Execute this workflow when you need to fine-tune a large language model but have limited GPU resources (e.g., a single GPU with 16-24GB VRAM). QLoRA enables training of 7B+ parameter models on consumer GPUs. This is also appropriate when you want to maintain the base model unchanged and create lightweight, swappable adapters for different tasks.

Execution Steps

Step 1: Environment Setup

Install FastChat with training dependencies along with PEFT and BitsAndBytes libraries. For QLoRA, bitsandbytes>=0.39.0 and transformers>=4.30.0 are required. DeepSpeed is needed for the distributed training launcher.

Key considerations:

QLoRA requires bitsandbytes with CUDA support
Flash Attention can be optionally enabled for LLaMA models
ZeRO-3 is incompatible with QLoRA; use ZeRO-2 instead
ZeRO-3 does work with standard LoRA (without quantization)

Step 2: Data Preparation

Prepare conversation data in ShareGPT JSON format, identical to the full SFT workflow. The same data preprocessing pipeline applies: the conversations array with alternating human/gpt turns, cleaned and split to fit within the model's max sequence length.

Key considerations:

Same data format as full SFT: JSON with "conversations" field
Use data/dummy_conversation.json for testing
The preprocessing and tokenization logic is shared with the full SFT train.py

Step 3: Model Loading with Quantization

Load the base model with optional 4-bit quantization. In QLoRA mode, the BitsAndBytesConfig applies NF4 quantization with double quantization for additional memory savings. The compute dtype is set based on the training precision (fp16/bf16). Device mapping is configured for distributed training compatibility.

Key considerations:

4-bit NF4 quantization with double quantization is the default QLoRA configuration
Device mapping must be set correctly for DDP (one GPU per rank)
FSDP and ZeRO-3 are incompatible with QLoRA; a warning is logged
Without QLoRA, the model loads in full precision with standard LoRA

Step 4: LoRA Adapter Injection

Configure and inject LoRA adapters into the model. The LoRA configuration specifies rank (r), alpha scaling factor, dropout, and target modules. For QLoRA, the model is first prepared for k-bit training which handles gradient computation through quantized weights. The PEFT library wraps the model with trainable adapter layers.

Key considerations:

Default target modules: q_proj and v_proj (attention query and value projections)
Default hyperparameters: r=8, alpha=16, dropout=0.05
prepare_model_for_kbit_training handles quantized gradient flow in QLoRA mode
Flash Attention compatibility requires casting norm and embedding layers to compute dtype
Gradient checkpointing requires enabling input_require_grads on the model

Step 5: Distributed Training with DeepSpeed

Launch the training loop using HuggingFace Trainer with DeepSpeed ZeRO-2 backend. The tokenizer and data module are configured identically to the full SFT workflow. Training uses the same supervised data module with conversation preprocessing and target masking.

Key considerations:

Use the DeepSpeed ZeRO-2 config from playground/deepspeed_config_s2.json
Standard hyperparameters: lr 2e-5, 3 epochs, cosine scheduler, warmup 0.03
The pad token is set to the unknown token
Checkpoint resumption is automatic if existing checkpoints are detected
Training can be launched via: deepspeed fastchat/train/train_lora.py --deepspeed ...

Step 6: Adapter Saving

Save only the LoRA adapter weights. In ZeRO-3 mode, the consolidated 16-bit state dict is gathered across ranks. In other modes, the LoRA parameters are extracted from the full model state using a custom extraction function that handles DeepSpeed ZeRO parameter partitioning. Only rank 0 performs the save.

Key considerations:

Only adapter weights (lora_ parameters) are saved, not the full base model
ZeRO-3 uses the engine's internal consolidated state dict
The bias saving strategy is configurable (none, all, lora_only)
The saved adapter can be loaded with PEFT's from_pretrained for inference

Execution Diagram

GitHub URL

Workflow Repository