Workflow:OpenRLHF OpenRLHF Iterative DPO
| Knowledge Sources | |
|---|---|
| Domains | LLMs, RLHF, DPO, Online_Learning |
| Last Updated | 2026-02-07 10:00 GMT |
Overview
Online iterative DPO pipeline that generates on-policy data each iteration, scores and ranks responses to create preference pairs, and trains with DPO.
Description
This workflow extends standard offline DPO to an online iterative setting. In each iteration, the current policy generates multiple candidate responses per prompt, a reward model scores them, and the best and worst responses are paired to create fresh preference data. The policy is then trained using the DPO objective on these newly created preference pairs, with the previous iteration model serving as the reference. This online approach addresses the distribution shift problem of offline DPO by ensuring training data always comes from the current policy.
Usage
Execute this workflow when you want the alignment benefits of DPO but need to address the distribution mismatch between the training data and the current policy. Iterative DPO is suitable when you have a reward model but want a simpler training loop than PPO, while still benefiting from on-policy data generation. It typically converges faster than offline DPO and produces better results when the initial preference dataset does not closely match the model distribution.
Execution Steps
Step 1: Prepare initial models and data
Load the SFT-trained policy model and the reward model. Prepare the prompt dataset for candidate generation. Initialize the reference model (same as the initial policy for the first iteration).
Key considerations:
- The reference model is updated to the previous iteration's policy at each round
- The reward model remains fixed across all iterations
- Prompt diversity is critical for broad coverage
Step 2: Generate candidate responses
Using vLLM batch inference, generate multiple candidate responses per prompt (typically N=8) with high-temperature sampling. The candidates provide the raw material for constructing preference pairs.
Key considerations:
- Higher N values produce better preference pairs but require more compute
- Temperature of 1.0 is typical for maximum diversity
- Generate enough rollouts to cover the training batch requirements
Step 3: Score and create preference pairs
Run batch reward model inference on all candidates. Apply the iterative DPO post-processor to select the highest-scored response as "chosen" and the lowest-scored as "rejected" for each prompt, forming fresh preference pairs.
Key considerations:
- The iter_dpo post-processor creates paired chosen/rejected data from scored candidates
- Only the most and least preferred responses are selected per prompt
- The quality gap between chosen and rejected affects training signal strength
Step 4: Train with DPO objective
Train the policy model on the newly created preference pairs using the DPO objective. Use the previous iteration's model as the reference for the KL constraint. Apply standard DPO hyperparameters (beta, learning rate).
Key considerations:
- The reference model should be the model from the previous iteration
- Typical DPO hyperparameters apply (beta=0.1, lr=5e-7)
- Training for 1 epoch per iteration prevents overfitting on generated data
- ZeRO-3 is needed to fit both policy and reference models
Step 5: Iterate or terminate
Check if the maximum number of iterations has been reached. If not, update both the policy model and reference model to the newly trained version and return to Step 2.
Key considerations:
- Typical iteration counts range from 3 to 5 for iterative DPO
- Monitor reward scores on held-out prompts across iterations
- The reference model update is critical: it prevents the policy from diverging too far from the previous step