Workflow:Huggingface Trl Direct Preference Optimization

Knowledge Sources	HuggingFace TRL TRL DPO Trainer Docs DPO Paper
Domains	LLMs, Preference_Optimization, RLHF
Last Updated	2026-02-06 16:00 GMT

Overview

End-to-end process for aligning language models with human preferences using Direct Preference Optimization (DPO), which learns from preference pairs without requiring a separate reward model.

Description

This workflow implements offline preference optimization where the model learns directly from pairs of chosen and rejected responses. DPO reformulates the RLHF objective as a classification problem on preference pairs, eliminating the need for a separate reward model or online generation. TRL's DPOTrainer supports multiple loss variants including sigmoid (standard DPO), IPO, RPO, and f-divergence regularization (reverse KL, Jensen-Shannon, alpha-divergence). It requires a policy model and either a separate reference model or uses the PEFT base model as an implicit reference.

Usage

Execute this workflow after supervised fine-tuning when you have a preference dataset containing chosen and rejected response pairs. This is appropriate when you want to align model outputs with human preferences without the complexity of online RL methods (GRPO, PPO), or when you have a static preference dataset and want a simpler training pipeline.

Execution Steps

Step 1: Environment and Argument Configuration

Configure the DPO training run by specifying the policy model (typically an SFT-trained model), dataset source, and DPO-specific hyperparameters. Key DPO parameters include the beta coefficient (KL penalty strength) and the loss type.

Key considerations:

Learning rate is much lower than SFT: ~5e-7 for full fine-tuning, ~5e-6 for LoRA
beta controls the deviation from the reference model (default: 0.1)
loss_type selects the DPO variant: sigmoid (standard), ipo, hinge, exo, rpo_bwd, etc.
Set remove_unused_columns to False to preserve all dataset fields

Step 2: Policy Model Loading

Load the fine-tuned policy model that will be optimized. This is typically the output of an SFT training run. The model is loaded with appropriate dtype and optional quantization for memory efficiency.

Key considerations:

Use the SFT-trained model checkpoint as the starting point
Apply the same dtype and quantization settings used during SFT
If using DeepSpeed, handle bias buffer dtype mismatches for DDP compatibility

Step 3: Reference Model Setup

Set up the reference model that provides the baseline log-probabilities for the KL divergence constraint. The reference model is the frozen copy of the policy model before DPO training begins.

Key considerations:

With PEFT/LoRA: the reference model is implicit (the frozen base model), so set ref_model to None
Without PEFT: load a separate copy of the SFT model as the reference
precompute_ref_log_probs can cache reference outputs to avoid recomputation each epoch
Reference model stays frozen throughout training

Step 4: Preference Dataset Loading

Load the preference dataset containing prompt-chosen-rejected triples. Each example must have a prompt with corresponding chosen (preferred) and rejected (dispreferred) responses.

Key considerations:

Standard format: prompt, chosen, and rejected fields
Conversational format uses lists of message dicts with role/content
The trainer tokenizes and creates paired batches internally via DataCollatorForPreference
max_length controls the maximum combined sequence length (prompt + response)
truncation_mode can be "keep_end" (default) or "keep_start"

Step 5: Trainer Initialization and Training

Create the DPOTrainer with policy model, reference model, preference dataset, and configuration. The trainer handles tokenization, paired batch construction, and the preference optimization loop.

Key considerations:

The trainer computes log-probabilities for both chosen and rejected under policy and reference models
Loss is computed as: -log_sigmoid(beta * (log_ratio_chosen - log_ratio_rejected))
disable_dropout is True by default to stabilize training
Metrics include reward accuracy, chosen/rejected rewards, and reward margins

Step 6: Evaluation and Model Saving

Evaluate the trained model on a held-out preference dataset to verify alignment improvement, then save the model. Track reward margins and accuracy as key quality indicators.

Key considerations:

Monitor rewards/margins (should increase) and rewards/accuracies (should approach 1.0)
Save the model or LoRA adapters for downstream use
The aligned model can be used directly for inference or as input for further training stages

Execution Diagram

GitHub URL

Workflow Repository