Workflow:Unslothai Unsloth QLoRA SFT Finetuning
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Fine_Tuning, QLoRA |
| Last Updated | 2026-02-07 09:00 GMT |
Overview
End-to-end process for parameter-efficient fine-tuning of large language models using 4-bit QLoRA with Unsloth's optimized training pipeline and TRL's SFTTrainer.
Description
This workflow outlines the standard procedure for fine-tuning Large Language Models on custom datasets using consumer-grade hardware. It leverages Unsloth's unified model loader to automatically detect architecture and apply 4-bit NormalFloat quantization via bitsandbytes, then injects Low-Rank Adapters (LoRA) into the frozen base model's attention and feedforward layers. Training is performed through TRL's SFTTrainer with Unsloth's custom gradient checkpointing that reduces VRAM by 30%. The process covers data preparation with chat templates, model loading with quantization, LoRA adapter injection, supervised fine-tuning, and saving or merging the trained adapter weights back into the base model.
Key capabilities:
- 2-5x training speedup through custom Triton GPU kernels for RoPE, normalization, and cross-entropy
- 70% less VRAM usage enabling 7B+ models on single consumer GPUs
- Support for 40+ model families (Llama, Mistral, Gemma, Qwen, Cohere, Granite, Falcon, etc.)
- 4-bit, 8-bit, 16-bit, and full fine-tuning modes
- Padding-free sequence packing for efficient batching
Usage
Execute this workflow when you have a structured dataset (instruction-tuning, conversational, or ShareGPT format) and need to adapt a base or instruct model to follow domain-specific instructions, but have limited GPU resources (e.g., less than 24GB VRAM). This is the primary entry point for most Unsloth users.
Execution Steps
Step 1: Data Preparation
Load and format the training dataset into the structured prompt format expected by the target model. This involves loading data from HuggingFace Hub (or local files), mapping fields to a consistent template (such as instruction/input/output or multi-turn conversation), and applying the model's chat template for proper tokenization boundaries. Unsloth provides built-in chat template support for 40+ model families and utilities like ShareGPT standardization and response-only training masking.
Key considerations:
- Choose the correct chat template for your model family (e.g., llama-3.1, chatml, gemma, etc.)
- Ensure all examples follow a consistent schema with proper EOS tokens
- Use ShareGPT standardization for multi-turn conversation data
- For raw text (not instruction data), use the RawTextDataLoader utility
- Consider enabling padding-free packing for datasets with variable-length sequences
Step 2: Model Loading
Initialize the language model through Unsloth's unified loader, which auto-detects the model architecture from over 1,250 known mappings and applies the appropriate optimization backend. The loader handles 4-bit NormalFloat quantization via bitsandbytes, configures RoPE scaling for extended context lengths, patches attention layers with optimized Triton kernels, and sets up the tokenizer with any necessary fixes.
Key considerations:
- Choose quantization mode: 4-bit (default, lowest VRAM), 8-bit, 16-bit, or full fine-tuning
- Set max_seq_length according to your data and GPU capacity (RoPE scaling handles extension automatically)
- The loader automatically maps model names to the correct architecture-specific optimization module
- Pre-quantized 4-bit models from unsloth/ on HuggingFace Hub download 4x faster
Step 3: LoRA Adapter Injection
Inject low-rank adapter matrices into the frozen base model's linear layers. Only these small adapter weights are trained, dramatically reducing memory requirements and training time while preserving the base model's capabilities. Unsloth's gradient checkpointing variant further reduces VRAM by offloading intermediate activations.
What happens:
- Original weight matrix W remains frozen
- Two small matrices A and B are added: W' = W + BA
- Target modules typically include q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
- Only A and B are updated during training (typically less than 1% of total parameters)
- Unsloth's gradient checkpointing mode uses 30% less VRAM than standard checkpointing
Step 4: Training Configuration
Configure the SFTTrainer with training hyperparameters optimized for QLoRA fine-tuning. This includes setting batch size, gradient accumulation, learning rate schedule, precision mode (bf16/fp16 auto-detection), and the optimizer. Unsloth patches the trainer to use its optimized training loop with fused cross-entropy loss and efficient LoRA weight updates.
Key considerations:
- Use adamw_8bit optimizer for memory-efficient training
- Enable bf16 on Ampere+ GPUs (auto-detected via is_bfloat16_supported)
- Adjust gradient_accumulation_steps to simulate larger effective batch sizes
- The UnslothTrainer patches TRL's SFTTrainer for compatibility with gradient checkpointing and packed sequences
- Optionally enable packing for 5x faster training on short sequences
Step 5: Model Training
Execute the supervised fine-tuning training loop. The trainer iterates over the formatted dataset, computing loss only on assistant response tokens (when using train_on_responses_only), and updating only the LoRA adapter weights. Unsloth's custom Triton kernels accelerate the forward and backward passes through fused operations for RoPE, normalization, activation functions, and cross-entropy loss.
What happens:
- Forward pass uses Triton-accelerated kernels for 2-5x speedup
- Cross-entropy loss is computed in chunks to avoid materializing the full logit tensor
- Backward pass uses custom autograd with fused LoRA weight updates
- Gradient checkpointing trades compute for memory, enabling longer context windows
- Training metrics (loss, learning rate) are logged at each step
Step 6: Model Saving
Save the trained model in one of several formats. The LoRA adapter can be saved standalone for later loading, or merged back into the base model at full 16-bit precision. Merging dequantizes the 4-bit base weights, adds the LoRA deltas, and saves the resulting full-precision model. The merged model can then be used directly with any HuggingFace-compatible inference framework.
Save options:
- LoRA adapter only (smallest, requires base model at inference time)
- Merged 16-bit (full model with LoRA weights baked in)
- Merged 4-bit (quantized merged model)
- Push to HuggingFace Hub for sharing and deployment