Principle:Huggingface Trl DPO Argument Configuration
| Knowledge Sources | |
|---|---|
| Domains | NLP, RLHF |
| Last Updated | 2026-02-06 17:00 GMT |
Overview
Configuring hyperparameters for offline preference optimization controls how a policy model learns from human preference pairs without requiring a separate reward model.
Description
Direct Preference Optimization (DPO) replaces the traditional RLHF pipeline (reward model training followed by PPO) with a single supervised-style objective. The configuration for DPO training extends standard supervised fine-tuning parameters with preference-specific hyperparameters that govern how the model learns from chosen/rejected response pairs.
The key configuration dimensions are:
- Loss variant selection: DPO supports multiple loss formulations beyond the original sigmoid loss, including IPO, hinge (SLiC), robust DPO, EXO, NCA, BCO, SPPO, AOT, DiscoPOP, and APO variants. Each variant modifies how the preference signal is converted into a gradient update. Multiple loss types can be combined (as in MPO) with configurable weights.
- KL divergence control (beta): The beta parameter controls the strength of the KL-divergence penalty between the policy and reference model. Higher beta values constrain the policy to stay closer to the reference distribution, while lower values allow more aggressive optimization toward the preference signal. Typical values range from 0.1 to 0.5.
- f-divergence regularization: Beyond standard reverse KL divergence, DPO supports Jensen-Shannon divergence and alpha-divergence as alternative regularization functions for computing the divergence between policy and reference model distributions.
- Label smoothing: Borrowed from the cDPO and Robust DPO papers, this parameter (between 0.0 and 0.5) accounts for noise in preference labels by softening the binary preference signal.
- Reference model management: Configuration includes whether to precompute reference model log probabilities (trading memory for compute), whether to synchronize the reference model with the policy (TR-DPO), and the synchronization schedule.
- Sequence length management: The max_length parameter caps the total length of prompt plus completion, while truncation_mode controls whether to keep the start or end of overlong sequences.
Usage
Use DPO argument configuration whenever you need to:
- Set up a DPO training run with specific hyperparameters
- Experiment with different loss variants (sigmoid, IPO, hinge, etc.)
- Tune the KL penalty strength via beta
- Configure memory-efficient training with precomputed reference log probabilities
- Combine multiple loss functions for MPO-style training
- Load configurations from YAML files or command-line arguments via TrlParser
Theoretical Basis
The DPO objective derives from the closed-form solution to the KL-constrained RLHF problem. Given a preference dataset of (prompt, chosen, rejected) triplets, the standard sigmoid DPO loss is:
L_DPO(pi_theta; pi_ref) = -E[ log sigma( beta * ( log(pi_theta(y_w|x) / pi_ref(y_w|x)) - log(pi_theta(y_l|x) / pi_ref(y_l|x)) ) ) ]
Where:
- pi_theta is the policy model being optimized
- pi_ref is the frozen reference model
- y_w and y_l are the chosen (winning) and rejected (losing) responses
- beta controls the deviation from the reference model
- sigma is the sigmoid function
The beta parameter directly corresponds to the inverse temperature in the RLHF reward-KL tradeoff. As beta approaches 0, the policy ignores the reference model constraint entirely; as beta approaches infinity, the policy remains fixed at the reference.
Label smoothing modifies the loss to be robust to noisy preferences:
L_smooth = -(1 - epsilon) * log sigma(beta * logits) - epsilon * log sigma(-beta * logits)
where epsilon is the label_smoothing parameter.
IPO uses a squared loss instead of the sigmoid:
L_IPO = (logits - 1/(2*tau))^2
where tau corresponds to the beta parameter.