Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:OpenRLHF OpenRLHF Iterative DPO

From Leeroopedia
Revision as of 17:58, 16 February 2026 by Admin (talk | contribs) (Auto-imported from principles/OpenRLHF_OpenRLHF_Iterative_DPO.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains Alignment, Data_Processing
Last Updated 2026-02-07 00:00 GMT

Overview

An iterative alignment technique that generates on-policy preference pairs by scoring multiple responses and selecting the best and worst for DPO retraining.

Description

Iterative DPO extends standard DPO to work with on-policy data. For each prompt, multiple responses are generated from the current policy, scored by a reward model, and the highest-scoring becomes "chosen" while the lowest-scoring becomes "rejected." These synthetic preference pairs are used for a round of DPO training, and the process repeats.

Usage

Use when DPO is preferred over PPO but off-policy static preference data is insufficient. Iterative DPO progressively improves the model using its own generations.

Theoretical Basis

For each prompt x:

  1. Generate N responses: y1,...,yNπθ(|x)
  2. Score each: ri=R(x,yi)
  3. Select chosen: yw=argmaxiri
  4. Select rejected: yl=argminiri
  5. DPO train on (x,yw,yl) pairs

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment