Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:OpenRLHF OpenRLHF Preference Dataset Construction

From Leeroopedia


Knowledge Sources
Domains Data_Processing, NLP, Reward_Modeling
Last Updated 2026-02-07 00:00 GMT

Overview

A dataset preparation technique that tokenizes paired preference data (chosen vs rejected responses) for reward model training and direct preference optimization.

Description

Preference Dataset Construction processes human preference data where each example contains a prompt with a chosen (preferred) and rejected (dispreferred) response. The dataset tokenizes both responses, handles padding asymmetry between chosen and rejected sequences, and supports two modes: RM mode (left-padded for reward scoring) and DPO mode (right-padded for log-probability computation with prompt length tracking).

Usage

Use this principle when preparing data for reward model training (is_dpo=False) or DPO/iterative DPO training (is_dpo=True). The same RewardDataset class serves both use cases with the is_dpo flag controlling padding direction and output format.

Theoretical Basis

For Reward Model training: The model learns a reward function rθ such that: P(ywyl|x)=σ(rθ(x,yw)rθ(x,yl))

For DPO: The implicit reward is derived from log-probability ratios: r(x,y)=βlogπθ(y|x)πref(y|x)

In both cases, the dataset must provide separately tokenized chosen and rejected sequences for contrastive comparison.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment