Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Huggingface Peft QLoRA SFT Finetuning

From Leeroopedia
Revision as of 11:03, 16 February 2026 by Admin (talk | contribs) (Auto-imported from workflows/Huggingface_Peft_QLoRA_SFT_Finetuning.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains LLMs, Fine_Tuning, SFT, Quantization
Last Updated 2026-02-07 06:00 GMT

Overview

End-to-end supervised fine-tuning (SFT) of a causal language model using QLoRA (4-bit quantized LoRA) with TRL's SFTTrainer, supporting single-GPU, multi-GPU DDP, DeepSpeed, and FSDP configurations.

Description

This workflow covers the production-ready approach to supervised fine-tuning of large language models using PEFT. It combines 4-bit NF4 quantization with LoRA adapters (QLoRA) to enable training of 7B+ parameter models on consumer GPUs with limited VRAM. The TRL library's SFTTrainer provides a high-level abstraction that handles chat template formatting, dataset preprocessing, and PEFT integration automatically. The workflow supports scaling from single-GPU to multi-GPU setups via DDP, DeepSpeed ZeRO Stage 3, or FSDP.

Usage

Execute this workflow when you need to perform instruction-tuning or chat fine-tuning of a large language model on a conversational dataset, and you want the simplest, most production-ready training pipeline. This is the recommended workflow for fine-tuning models like Llama-2, Mistral, or similar architectures on instruction-following datasets with limited GPU resources (as low as 16GB VRAM for 7B models).

Execution Steps

Step 1: Configure Training Arguments

Define the training configuration using SFTConfig (from TRL) which extends the standard TrainingArguments with SFT-specific options. Specify the dataset text field, maximum sequence length, packing strategy, and output directory. Also define model arguments (model name, quantization settings, LoRA parameters) and data arguments (dataset name, chat template format).

Key considerations:

  • SFTConfig inherits all TrainingArguments and adds SFT-specific fields
  • Choose between packing (concatenating short examples) and standard padding
  • Set max_seq_length based on the model's context window and your data
  • Arguments can be specified via CLI, YAML config, or Python dataclasses

Step 2: Load and Quantize the Base Model

Load the pre-trained model with 4-bit quantization using BitsAndBytesConfig. Configure NF4 quantization type with double quantization and bfloat16 compute dtype for optimal memory-quality tradeoff. Load the tokenizer and configure special tokens for the chosen chat template format (e.g., ChatML, Zephyr). Resize model embeddings if new special tokens are added.

Key considerations:

  • NF4 with double quantization typically offers the best memory-quality tradeoff
  • Flash Attention 2 can be enabled for compatible models to speed up training
  • The tokenizer's chat template determines how conversations are formatted
  • Model embedding layer may need resizing when adding special tokens

Step 3: Configure and Apply LoRA

Create a LoraConfig specifying the adapter parameters: rank, alpha, dropout, target modules, and task type. The SFTTrainer will apply the adapter automatically when the peft_config is passed to it. This step does not require manually calling get_peft_model; the trainer handles the wrapping internally.

Key considerations:

  • SFTTrainer accepts peft_config directly and handles adapter application
  • Target modules should match the base model architecture
  • The LoRA config can optionally enable DoRA for improved quality
  • Bias can be set to "none", "all", or "lora_only" depending on the use case

Step 4: Prepare the Dataset

Load the training dataset from the Hugging Face Hub or disk. If using a chat-formatted dataset, apply the tokenizer's chat template to convert conversations into the expected token format. Split into training and evaluation sets. The SFTTrainer handles tokenization internally when given raw text data.

Key considerations:

  • Chat template formatting converts multi-turn conversations to a single string
  • The dataset can be loaded from the Hub, disk, or provided as a DataFrame
  • SFTTrainer handles tokenization and data collation automatically
  • Set append_concat_token and add_special_tokens flags as needed

Step 5: Train with SFTTrainer

Instantiate SFTTrainer with the model, tokenizer, datasets, training arguments, and PEFT config. The trainer handles the full training loop including optimizer setup, gradient accumulation, mixed precision, logging, evaluation, and checkpointing. For multi-GPU setups, launch with the appropriate Accelerate configuration (DDP, DeepSpeed, or FSDP).

Key considerations:

  • For single GPU, launch directly with python
  • For multi-GPU DDP, use torchrun or accelerate launch
  • For DeepSpeed ZeRO-3, provide an Accelerate DeepSpeed config
  • For FSDP, provide an Accelerate FSDP config
  • Gradient checkpointing reduces memory at the cost of compute

Step 6: Save the Trained Adapter

Save the final adapter checkpoint after training completes. The SFTTrainer saves only the adapter weights by default. For FSDP training, additional handling is needed to gather the full state dict before saving. The adapter can be pushed to the Hugging Face Hub for sharing.

Key considerations:

  • Only adapter weights are saved (a few MB vs the full model)
  • FSDP requires special handling with FullStateDictConfig for saving
  • The training state (optimizer, scheduler) can be saved separately for resumption
  • Push to Hub for easy sharing and versioning

Execution Diagram

GitHub URL

Workflow Repository