Workflow:Huggingface Trl Supervised Finetuning
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Fine_Tuning, NLP |
| Last Updated | 2026-02-06 16:00 GMT |
Overview
End-to-end process for supervised fine-tuning (SFT) of pretrained language models on instruction-following or domain-specific datasets using the TRL library.
Description
This workflow covers the standard procedure for adapting a pretrained causal language model to follow instructions or perform domain-specific tasks. It uses TRL's SFTTrainer, which extends the HuggingFace Transformers Trainer with features tailored for language model fine-tuning: chat template application, sequence packing, completion-only loss masking, and vision-language model support. The workflow supports both full-parameter fine-tuning and parameter-efficient methods (LoRA, QLoRA) to accommodate various hardware constraints.
Usage
Execute this workflow when you have a conversational or instruction-tuning dataset and need to adapt a base language model to follow instructions, generate in a specific style, or specialize in a domain. This is typically the first training step before applying preference optimization (DPO) or reinforcement learning (GRPO, RLOO).
Execution Steps
Step 1: Environment and Argument Configuration
Configure the training run by defining model parameters, dataset sources, and training hyperparameters. TRL supports configuration via CLI arguments, YAML config files, or direct Python instantiation. Key decisions include the base model, output directory, learning rate, batch size, and whether to use PEFT (LoRA).
Key considerations:
- Use TrlParser to combine YAML config files with CLI overrides
- Set learning rate to ~2e-5 for full fine-tuning, ~2e-4 for LoRA
- Configure gradient accumulation to achieve effective batch sizes of 16-64
- Enable gradient_checkpointing for memory reduction on large models
Step 2: Model Loading
Load the pretrained causal language model with appropriate dtype, attention implementation, and optional quantization. The loader automatically detects whether the model is a text-only or vision-language architecture and selects the correct model class.
Key considerations:
- Use bfloat16 dtype for modern GPUs (Ampere+)
- Enable 4-bit quantization via BitsAndBytesConfig for QLoRA setups
- Set attn_implementation to "flash_attention_2" when available
- The model class is auto-detected: AutoModelForCausalLM for text, AutoModelForImageTextToText for VLMs
Step 3: PEFT Configuration (Optional)
If using parameter-efficient fine-tuning, configure LoRA adapters that inject low-rank trainable matrices into the frozen base model. This reduces memory usage and training time while preserving the base model's capabilities.
Key considerations:
- Typical LoRA rank (r) is 8-64; alpha is usually 2x the rank
- Target modules default to attention layers (q_proj, v_proj, k_proj, o_proj)
- QLoRA combines 4-bit quantization with LoRA for maximum memory efficiency
- The base model weights remain frozen; only adapter parameters are trained
Step 4: Dataset Loading and Preparation
Load the training dataset and apply chat template formatting. TRL's SFTTrainer accepts datasets in conversational format (list of message dicts with role/content) or plain text format. The trainer automatically tokenizes and formats the data.
Key considerations:
- Conversational format: each example has a messages field with role/content dicts
- Prompt-completion format: separate prompt and completion fields
- Enable packing to concatenate short examples into full-length sequences, reducing padding waste
- Use completion_only_loss or assistant_only_loss to mask prompt tokens from the loss computation
Step 5: Trainer Initialization
Create the SFTTrainer instance with the loaded model, processed dataset, training configuration, and optional PEFT config. The trainer sets up the data collator, optimizer, and learning rate scheduler automatically.
Key considerations:
- The trainer handles chat template application internally via apply_chat_template
- Data collators are auto-selected: DataCollatorForLanguageModeling for text, DataCollatorForVisionLanguageModeling for VLMs
- If PEFT config is provided, the model is wrapped with adapter layers automatically
Step 6: Training Execution
Run the training loop, which performs forward passes, loss computation, backpropagation, and optimizer steps across all training examples. The trainer logs metrics including loss, learning rate, token accuracy, and entropy.
Key considerations:
- Loss is computed as token-level cross-entropy on non-masked positions
- Optional DFT loss (Dynamic Fine-Tuning) can be enabled for alternative loss computation
- Evaluation runs at configured intervals to monitor overfitting
- Gradient checkpointing trades compute for memory when enabled
Step 7: Model Saving and Distribution
Save the trained model (or LoRA adapters) to disk and optionally push to the HuggingFace Hub. If using PEFT, only the adapter weights are saved; the base model reference is preserved for later merging.
Key considerations:
- With PEFT, the saved checkpoint is very small (adapter weights only)
- Use push_to_hub to share the model on HuggingFace Hub
- Model cards are auto-generated with training metadata
- The saved model can be loaded directly for inference or used as input for DPO/GRPO training