Principle:OpenRLHF OpenRLHF Iterative DPO
| Knowledge Sources | |
|---|---|
| Domains | Alignment, Data_Processing |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
An iterative alignment technique that generates on-policy preference pairs by scoring multiple responses and selecting the best and worst for DPO retraining.
Description
Iterative DPO extends standard DPO to work with on-policy data. For each prompt, multiple responses are generated from the current policy, scored by a reward model, and the highest-scoring becomes "chosen" while the lowest-scoring becomes "rejected." These synthetic preference pairs are used for a round of DPO training, and the process repeats.
Usage
Use when DPO is preferred over PPO but off-policy static preference data is insufficient. Iterative DPO progressively improves the model using its own generations.
Theoretical Basis
For each prompt :
- Generate responses:
- Score each:
- Select chosen:
- Select rejected:
- DPO train on pairs