Workflow:Unslothai Unsloth QLoRA SFT Finetuning

Knowledge Sources	Unsloth Unsloth Docs Fine-tuning Guide
Domains	LLMs, Fine_Tuning, QLoRA
Last Updated	2026-02-07 09:00 GMT

Overview

End-to-end process for parameter-efficient fine-tuning of large language models using 4-bit QLoRA with Unsloth's optimized training pipeline and TRL's SFTTrainer.

Description

This workflow outlines the standard procedure for fine-tuning Large Language Models on custom datasets using consumer-grade hardware. It leverages Unsloth's unified model loader to automatically detect architecture and apply 4-bit NormalFloat quantization via bitsandbytes, then injects Low-Rank Adapters (LoRA) into the frozen base model's attention and feedforward layers. Training is performed through TRL's SFTTrainer with Unsloth's custom gradient checkpointing that reduces VRAM by 30%. The process covers data preparation with chat templates, model loading with quantization, LoRA adapter injection, supervised fine-tuning, and saving or merging the trained adapter weights back into the base model.

Key capabilities:

2-5x training speedup through custom Triton GPU kernels for RoPE, normalization, and cross-entropy
70% less VRAM usage enabling 7B+ models on single consumer GPUs
Support for 40+ model families (Llama, Mistral, Gemma, Qwen, Cohere, Granite, Falcon, etc.)
4-bit, 8-bit, 16-bit, and full fine-tuning modes
Padding-free sequence packing for efficient batching

Usage

Execute this workflow when you have a structured dataset (instruction-tuning, conversational, or ShareGPT format) and need to adapt a base or instruct model to follow domain-specific instructions, but have limited GPU resources (e.g., less than 24GB VRAM). This is the primary entry point for most Unsloth users.

Execution Steps

Step 1: Data Preparation

Load and format the training dataset into the structured prompt format expected by the target model. This involves loading data from HuggingFace Hub (or local files), mapping fields to a consistent template (such as instruction/input/output or multi-turn conversation), and applying the model's chat template for proper tokenization boundaries. Unsloth provides built-in chat template support for 40+ model families and utilities like ShareGPT standardization and response-only training masking.

Key considerations:

Choose the correct chat template for your model family (e.g., llama-3.1, chatml, gemma, etc.)
Ensure all examples follow a consistent schema with proper EOS tokens
Use ShareGPT standardization for multi-turn conversation data
For raw text (not instruction data), use the RawTextDataLoader utility
Consider enabling padding-free packing for datasets with variable-length sequences

Step 2: Model Loading

Initialize the language model through Unsloth's unified loader, which auto-detects the model architecture from over 1,250 known mappings and applies the appropriate optimization backend. The loader handles 4-bit NormalFloat quantization via bitsandbytes, configures RoPE scaling for extended context lengths, patches attention layers with optimized Triton kernels, and sets up the tokenizer with any necessary fixes.

Key considerations:

Choose quantization mode: 4-bit (default, lowest VRAM), 8-bit, 16-bit, or full fine-tuning
Set max_seq_length according to your data and GPU capacity (RoPE scaling handles extension automatically)
The loader automatically maps model names to the correct architecture-specific optimization module
Pre-quantized 4-bit models from unsloth/ on HuggingFace Hub download 4x faster

Step 3: LoRA Adapter Injection

Inject low-rank adapter matrices into the frozen base model's linear layers. Only these small adapter weights are trained, dramatically reducing memory requirements and training time while preserving the base model's capabilities. Unsloth's gradient checkpointing variant further reduces VRAM by offloading intermediate activations.

What happens:

Original weight matrix W remains frozen
Two small matrices A and B are added: W' = W + BA
Target modules typically include q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Only A and B are updated during training (typically less than 1% of total parameters)
Unsloth's gradient checkpointing mode uses 30% less VRAM than standard checkpointing

Step 4: Training Configuration

Configure the SFTTrainer with training hyperparameters optimized for QLoRA fine-tuning. This includes setting batch size, gradient accumulation, learning rate schedule, precision mode (bf16/fp16 auto-detection), and the optimizer. Unsloth patches the trainer to use its optimized training loop with fused cross-entropy loss and efficient LoRA weight updates.

Key considerations:

Use adamw_8bit optimizer for memory-efficient training
Enable bf16 on Ampere+ GPUs (auto-detected via is_bfloat16_supported)
Adjust gradient_accumulation_steps to simulate larger effective batch sizes
The UnslothTrainer patches TRL's SFTTrainer for compatibility with gradient checkpointing and packed sequences
Optionally enable packing for 5x faster training on short sequences

Step 5: Model Training

Execute the supervised fine-tuning training loop. The trainer iterates over the formatted dataset, computing loss only on assistant response tokens (when using train_on_responses_only), and updating only the LoRA adapter weights. Unsloth's custom Triton kernels accelerate the forward and backward passes through fused operations for RoPE, normalization, activation functions, and cross-entropy loss.

What happens:

Forward pass uses Triton-accelerated kernels for 2-5x speedup
Cross-entropy loss is computed in chunks to avoid materializing the full logit tensor
Backward pass uses custom autograd with fused LoRA weight updates
Gradient checkpointing trades compute for memory, enabling longer context windows
Training metrics (loss, learning rate) are logged at each step

Step 6: Model Saving

Save the trained model in one of several formats. The LoRA adapter can be saved standalone for later loading, or merged back into the base model at full 16-bit precision. Merging dequantizes the 4-bit base weights, adds the LoRA deltas, and saves the resulting full-precision model. The merged model can then be used directly with any HuggingFace-compatible inference framework.

Save options:

LoRA adapter only (smallest, requires base model at inference time)
Merged 16-bit (full model with LoRA weights baked in)
Merged 4-bit (quantized merged model)
Push to HuggingFace Hub for sharing and deployment

Execution Diagram

GitHub URL

Workflow Repository