Workflow:Hiyouga LLaMA Factory DPO Preference Alignment

Knowledge Sources	LLaMA-Factory LLaMA-Factory Docs DPO Paper
Domains	LLMs, Fine_Tuning, RLHF, DPO
Last Updated	2026-02-06 19:00 GMT

Overview

End-to-end process for aligning a supervised fine-tuned language model with human preferences using Direct Preference Optimization (DPO) and its variants.

Description

This workflow implements preference-based alignment without requiring a separate reward model. Starting from an SFT-tuned model, DPO directly optimizes the policy to prefer human-chosen responses over rejected ones using a classification-style loss derived from the RLHF objective. LLaMA-Factory supports multiple DPO loss variants including standard sigmoid DPO, ORPO (odds ratio preference optimization which combines SFT and alignment), SimPO (simple preference optimization with length normalization), and IPO. The workflow handles reference model management, pairwise data processing, and preference loss computation.

Usage

Execute this workflow after completing supervised fine-tuning when you have a dataset of paired preferences (chosen vs. rejected responses to the same prompt) and want to improve the model's alignment with human values. This is simpler and more stable than PPO-based RLHF, requiring no reward model training.

Execution Steps

Step 1: Configuration

Define the DPO training job with a YAML configuration specifying the SFT-tuned model as the base, the preference dataset, DPO-specific hyperparameters (beta, loss type), and standard training settings. The configuration must set stage: dpo to route to the DPO workflow.

Key considerations:

The base model should already be SFT-tuned (either a merged model or base + LoRA adapter)
Set pref_beta (typically 0.1) to control the strength of the KL penalty
Choose pref_loss: sigmoid (standard DPO), orpo, simpo, ipo, or other variants
For ORPO/SimPO, no reference model is needed (self-referencing)

Step 2: Argument Parsing and Validation

Parse the YAML configuration into argument objects and validate DPO-specific settings. The parser ensures the dataset contains paired preference data (chosen/rejected fields), validates the loss type selection, and configures reference model behavior based on the chosen DPO variant.

What happens:

Arguments are parsed into ModelArguments, DataArguments, Seq2SeqTrainingArguments, and FinetuningArguments
The finetuning stage is set to "dpo" which selects the pairwise data processor
Reference model configuration is determined (separate model for sigmoid/IPO, disabled for ORPO/SimPO)

Step 3: Pairwise Data Loading

Load preference datasets containing paired responses and process them into the pairwise format required for DPO training. Each example consists of a prompt with both a chosen (preferred) and rejected (dispreferred) response, tokenized with proper label masking.

Key considerations:

Data must contain chosen and rejected response pairs for each prompt
Both Alpaca-style and ShareGPT-style preference formats are supported
The pairwise processor creates concatenated sequences with separate label masks for chosen and rejected
DPO demo data is provided at data/dpo_en_demo.json and data/dpo_zh_demo.json

Step 4: Model and Reference Model Loading

Load the SFT-tuned model (the policy to be optimized) and create a reference model copy. The reference model provides the baseline log-probabilities needed for the DPO loss computation. For variants like ORPO and SimPO that are self-referencing, the reference model step is skipped.

What happens:

The policy model is loaded with adapter configuration (LoRA or full)
For standard DPO: a separate reference model is created by either loading a frozen copy or using the adapter's base model
The reference model remains frozen throughout training
Both models share the same tokenizer

Step 5: DPO Training

Execute the preference optimization loop using the CustomDPOTrainer. The trainer computes log-probabilities for both chosen and rejected sequences under both the policy and reference models, then optimizes the preference loss to increase the gap between chosen and rejected response likelihoods.

What happens:

For each batch, forward passes compute log-probs for chosen and rejected under both policy and reference models
The DPO loss is computed based on the chosen variant:
- Sigmoid DPO: -log(sigmoid(beta * (log_ratio_chosen - log_ratio_rejected)))
- ORPO: Combines SFT loss with odds ratio preference loss
- SimPO: Length-normalized preference optimization without reference model
Auxiliary losses (SFT loss on chosen responses) can be added with configurable weight
Training uses the same infrastructure as SFT (mixed precision, gradient accumulation, logging)

Step 6: Save Aligned Model

Save the preference-aligned model weights or adapter. The output represents a model that has been both SFT-tuned and preference-aligned, ready for deployment or further optimization stages.

Key considerations:

For LoRA DPO: only the updated adapter weights are saved
For full DPO: the complete model weights are saved
The aligned model can be used directly for inference or exported/merged

Execution Diagram

GitHub URL

Workflow Repository