Principle:OpenRLHF OpenRLHF Iterative DPO

Knowledge Sources	Direct Preference Optimization RLHF Workflow
Domains	Alignment, Data_Processing
Last Updated	2026-02-07 00:00 GMT

Overview

An iterative alignment technique that generates on-policy preference pairs by scoring multiple responses and selecting the best and worst for DPO retraining.

Description

Iterative DPO extends standard DPO to work with on-policy data. For each prompt, multiple responses are generated from the current policy, scored by a reward model, and the highest-scoring becomes "chosen" while the lowest-scoring becomes "rejected." These synthetic preference pairs are used for a round of DPO training, and the process repeats.

Usage

Use when DPO is preferred over PPO but off-policy static preference data is insufficient. Iterative DPO progressively improves the model using its own generations.

Theoretical Basis

For each prompt $x$ :

Generate $N$ responses: $y_{1}, . . ., y_{N} \sim π_{θ} (\cdot | x)$
Score each: $r_{i} = R (x, y_{i})$
Select chosen: $y_{w} = \arg \max_{i} r_{i}$
Select rejected: $y_{l} = \arg \min_{i} r_{i}$
DPO train on $(x, y_{w}, y_{l})$ pairs

Related Pages

Implemented By

Implementation:OpenRLHF_OpenRLHF_Iterative_dpo_processor

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment