Workflow:OpenRLHF OpenRLHF Iterative DPO

Knowledge Sources	OpenRLHF Hugging Face Transformers vLLM
Domains	LLMs, RLHF, DPO, Online_Learning
Last Updated	2026-02-07 10:00 GMT

Overview

Online iterative DPO pipeline that generates on-policy data each iteration, scores and ranks responses to create preference pairs, and trains with DPO.

Description

This workflow extends standard offline DPO to an online iterative setting. In each iteration, the current policy generates multiple candidate responses per prompt, a reward model scores them, and the best and worst responses are paired to create fresh preference data. The policy is then trained using the DPO objective on these newly created preference pairs, with the previous iteration model serving as the reference. This online approach addresses the distribution shift problem of offline DPO by ensuring training data always comes from the current policy.

Usage

Execute this workflow when you want the alignment benefits of DPO but need to address the distribution mismatch between the training data and the current policy. Iterative DPO is suitable when you have a reward model but want a simpler training loop than PPO, while still benefiting from on-policy data generation. It typically converges faster than offline DPO and produces better results when the initial preference dataset does not closely match the model distribution.

Execution Steps

Step 1: Prepare initial models and data

Load the SFT-trained policy model and the reward model. Prepare the prompt dataset for candidate generation. Initialize the reference model (same as the initial policy for the first iteration).

Key considerations:

The reference model is updated to the previous iteration's policy at each round
The reward model remains fixed across all iterations
Prompt diversity is critical for broad coverage

Step 2: Generate candidate responses

Using vLLM batch inference, generate multiple candidate responses per prompt (typically N=8) with high-temperature sampling. The candidates provide the raw material for constructing preference pairs.

Key considerations:

Higher N values produce better preference pairs but require more compute
Temperature of 1.0 is typical for maximum diversity
Generate enough rollouts to cover the training batch requirements

Step 3: Score and create preference pairs

Run batch reward model inference on all candidates. Apply the iterative DPO post-processor to select the highest-scored response as "chosen" and the lowest-scored as "rejected" for each prompt, forming fresh preference pairs.

Key considerations:

The iter_dpo post-processor creates paired chosen/rejected data from scored candidates
Only the most and least preferred responses are selected per prompt
The quality gap between chosen and rejected affects training signal strength

Step 4: Train with DPO objective

Train the policy model on the newly created preference pairs using the DPO objective. Use the previous iteration's model as the reference for the KL constraint. Apply standard DPO hyperparameters (beta, learning rate).

Key considerations:

The reference model should be the model from the previous iteration
Typical DPO hyperparameters apply (beta=0.1, lr=5e-7)
Training for 1 epoch per iteration prevents overfitting on generated data
ZeRO-3 is needed to fit both policy and reference models

Step 5: Iterate or terminate

Check if the maximum number of iterations has been reached. If not, update both the policy model and reference model to the newly trained version and return to Step 2.

Key considerations:

Typical iteration counts range from 3 to 5 for iterative DPO
Monitor reward scores on held-out prompts across iterations
The reference model update is critical: it prevents the policy from diverging too far from the previous step

Execution Diagram

GitHub URL

Workflow Repository