Workflow:Hiyouga LLaMA Factory DPO Preference Alignment
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Fine_Tuning, RLHF, DPO |
| Last Updated | 2026-02-06 19:00 GMT |
Overview
End-to-end process for aligning a supervised fine-tuned language model with human preferences using Direct Preference Optimization (DPO) and its variants.
Description
This workflow implements preference-based alignment without requiring a separate reward model. Starting from an SFT-tuned model, DPO directly optimizes the policy to prefer human-chosen responses over rejected ones using a classification-style loss derived from the RLHF objective. LLaMA-Factory supports multiple DPO loss variants including standard sigmoid DPO, ORPO (odds ratio preference optimization which combines SFT and alignment), SimPO (simple preference optimization with length normalization), and IPO. The workflow handles reference model management, pairwise data processing, and preference loss computation.
Usage
Execute this workflow after completing supervised fine-tuning when you have a dataset of paired preferences (chosen vs. rejected responses to the same prompt) and want to improve the model's alignment with human values. This is simpler and more stable than PPO-based RLHF, requiring no reward model training.
Execution Steps
Step 1: Configuration
Define the DPO training job with a YAML configuration specifying the SFT-tuned model as the base, the preference dataset, DPO-specific hyperparameters (beta, loss type), and standard training settings. The configuration must set stage: dpo to route to the DPO workflow.
Key considerations:
- The base model should already be SFT-tuned (either a merged model or base + LoRA adapter)
- Set
pref_beta(typically 0.1) to control the strength of the KL penalty - Choose
pref_loss: sigmoid (standard DPO), orpo, simpo, ipo, or other variants - For ORPO/SimPO, no reference model is needed (self-referencing)
Step 2: Argument Parsing and Validation
Parse the YAML configuration into argument objects and validate DPO-specific settings. The parser ensures the dataset contains paired preference data (chosen/rejected fields), validates the loss type selection, and configures reference model behavior based on the chosen DPO variant.
What happens:
- Arguments are parsed into ModelArguments, DataArguments, Seq2SeqTrainingArguments, and FinetuningArguments
- The finetuning stage is set to "dpo" which selects the pairwise data processor
- Reference model configuration is determined (separate model for sigmoid/IPO, disabled for ORPO/SimPO)
Step 3: Pairwise Data Loading
Load preference datasets containing paired responses and process them into the pairwise format required for DPO training. Each example consists of a prompt with both a chosen (preferred) and rejected (dispreferred) response, tokenized with proper label masking.
Key considerations:
- Data must contain chosen and rejected response pairs for each prompt
- Both Alpaca-style and ShareGPT-style preference formats are supported
- The pairwise processor creates concatenated sequences with separate label masks for chosen and rejected
- DPO demo data is provided at
data/dpo_en_demo.jsonanddata/dpo_zh_demo.json
Step 4: Model and Reference Model Loading
Load the SFT-tuned model (the policy to be optimized) and create a reference model copy. The reference model provides the baseline log-probabilities needed for the DPO loss computation. For variants like ORPO and SimPO that are self-referencing, the reference model step is skipped.
What happens:
- The policy model is loaded with adapter configuration (LoRA or full)
- For standard DPO: a separate reference model is created by either loading a frozen copy or using the adapter's base model
- The reference model remains frozen throughout training
- Both models share the same tokenizer
Step 5: DPO Training
Execute the preference optimization loop using the CustomDPOTrainer. The trainer computes log-probabilities for both chosen and rejected sequences under both the policy and reference models, then optimizes the preference loss to increase the gap between chosen and rejected response likelihoods.
What happens:
- For each batch, forward passes compute log-probs for chosen and rejected under both policy and reference models
- The DPO loss is computed based on the chosen variant:
- Sigmoid DPO:
-log(sigmoid(beta * (log_ratio_chosen - log_ratio_rejected))) - ORPO: Combines SFT loss with odds ratio preference loss
- SimPO: Length-normalized preference optimization without reference model
- Sigmoid DPO:
- Auxiliary losses (SFT loss on chosen responses) can be added with configurable weight
- Training uses the same infrastructure as SFT (mixed precision, gradient accumulation, logging)
Step 6: Save Aligned Model
Save the preference-aligned model weights or adapter. The output represents a model that has been both SFT-tuned and preference-aligned, ready for deployment or further optimization stages.
Key considerations:
- For LoRA DPO: only the updated adapter weights are saved
- For full DPO: the complete model weights are saved
- The aligned model can be used directly for inference or exported/merged