Workflow:Huggingface Peft QLoRA SFT Finetuning

Knowledge Sources	Huggingface PEFT PEFT Documentation TRL Documentation
Domains	LLMs, Fine_Tuning, SFT, Quantization
Last Updated	2026-02-07 06:00 GMT

Overview

End-to-end supervised fine-tuning (SFT) of a causal language model using QLoRA (4-bit quantized LoRA) with TRL's SFTTrainer, supporting single-GPU, multi-GPU DDP, DeepSpeed, and FSDP configurations.

Description

This workflow covers the production-ready approach to supervised fine-tuning of large language models using PEFT. It combines 4-bit NF4 quantization with LoRA adapters (QLoRA) to enable training of 7B+ parameter models on consumer GPUs with limited VRAM. The TRL library's SFTTrainer provides a high-level abstraction that handles chat template formatting, dataset preprocessing, and PEFT integration automatically. The workflow supports scaling from single-GPU to multi-GPU setups via DDP, DeepSpeed ZeRO Stage 3, or FSDP.

Usage

Execute this workflow when you need to perform instruction-tuning or chat fine-tuning of a large language model on a conversational dataset, and you want the simplest, most production-ready training pipeline. This is the recommended workflow for fine-tuning models like Llama-2, Mistral, or similar architectures on instruction-following datasets with limited GPU resources (as low as 16GB VRAM for 7B models).

Execution Steps

Step 1: Configure Training Arguments

Define the training configuration using SFTConfig (from TRL) which extends the standard TrainingArguments with SFT-specific options. Specify the dataset text field, maximum sequence length, packing strategy, and output directory. Also define model arguments (model name, quantization settings, LoRA parameters) and data arguments (dataset name, chat template format).

Key considerations:

SFTConfig inherits all TrainingArguments and adds SFT-specific fields
Choose between packing (concatenating short examples) and standard padding
Set max_seq_length based on the model's context window and your data
Arguments can be specified via CLI, YAML config, or Python dataclasses

Step 2: Load and Quantize the Base Model

Load the pre-trained model with 4-bit quantization using BitsAndBytesConfig. Configure NF4 quantization type with double quantization and bfloat16 compute dtype for optimal memory-quality tradeoff. Load the tokenizer and configure special tokens for the chosen chat template format (e.g., ChatML, Zephyr). Resize model embeddings if new special tokens are added.

Key considerations:

NF4 with double quantization typically offers the best memory-quality tradeoff
Flash Attention 2 can be enabled for compatible models to speed up training
The tokenizer's chat template determines how conversations are formatted
Model embedding layer may need resizing when adding special tokens

Step 3: Configure and Apply LoRA

Create a LoraConfig specifying the adapter parameters: rank, alpha, dropout, target modules, and task type. The SFTTrainer will apply the adapter automatically when the peft_config is passed to it. This step does not require manually calling get_peft_model; the trainer handles the wrapping internally.

Key considerations:

SFTTrainer accepts peft_config directly and handles adapter application
Target modules should match the base model architecture
The LoRA config can optionally enable DoRA for improved quality
Bias can be set to "none", "all", or "lora_only" depending on the use case

Step 4: Prepare the Dataset

Load the training dataset from the Hugging Face Hub or disk. If using a chat-formatted dataset, apply the tokenizer's chat template to convert conversations into the expected token format. Split into training and evaluation sets. The SFTTrainer handles tokenization internally when given raw text data.

Key considerations:

Chat template formatting converts multi-turn conversations to a single string
The dataset can be loaded from the Hub, disk, or provided as a DataFrame
SFTTrainer handles tokenization and data collation automatically
Set append_concat_token and add_special_tokens flags as needed

Step 5: Train with SFTTrainer

Instantiate SFTTrainer with the model, tokenizer, datasets, training arguments, and PEFT config. The trainer handles the full training loop including optimizer setup, gradient accumulation, mixed precision, logging, evaluation, and checkpointing. For multi-GPU setups, launch with the appropriate Accelerate configuration (DDP, DeepSpeed, or FSDP).

Key considerations:

For single GPU, launch directly with python
For multi-GPU DDP, use torchrun or accelerate launch
For DeepSpeed ZeRO-3, provide an Accelerate DeepSpeed config
For FSDP, provide an Accelerate FSDP config
Gradient checkpointing reduces memory at the cost of compute

Step 6: Save the Trained Adapter

Save the final adapter checkpoint after training completes. The SFTTrainer saves only the adapter weights by default. For FSDP training, additional handling is needed to gather the full state dict before saving. The adapter can be pushed to the Hugging Face Hub for sharing.

Key considerations:

Only adapter weights are saved (a few MB vs the full model)
FSDP requires special handling with FullStateDictConfig for saving
The training state (optimizer, scheduler) can be saved separately for resumption
Push to Hub for easy sharing and versioning

Execution Diagram

GitHub URL

Workflow Repository