Principle:OpenRLHF OpenRLHF Preference Dataset Construction

Knowledge Sources	Training language models to follow instructions with human feedback Direct Preference Optimization
Domains	Data_Processing, NLP, Reward_Modeling
Last Updated	2026-02-07 00:00 GMT

Overview

A dataset preparation technique that tokenizes paired preference data (chosen vs rejected responses) for reward model training and direct preference optimization.

Description

Preference Dataset Construction processes human preference data where each example contains a prompt with a chosen (preferred) and rejected (dispreferred) response. The dataset tokenizes both responses, handles padding asymmetry between chosen and rejected sequences, and supports two modes: RM mode (left-padded for reward scoring) and DPO mode (right-padded for log-probability computation with prompt length tracking).

Usage

Use this principle when preparing data for reward model training (is_dpo=False) or DPO/iterative DPO training (is_dpo=True). The same RewardDataset class serves both use cases with the is_dpo flag controlling padding direction and output format.

Theoretical Basis

For Reward Model training: The model learns a reward function $r_{θ}$ such that: $P (y_{w} ≻ y_{l} | x) = σ (r_{θ} (x, y_{w}) - r_{θ} (x, y_{l}))$

For DPO: The implicit reward is derived from log-probability ratios: $r (x, y) = β \log \frac{π_{θ} (y | x)}{π_{r e f} (y | x)}$

In both cases, the dataset must provide separately tokenized chosen and rejected sequences for contrastive comparison.

Related Pages

Implemented By

Implementation:OpenRLHF_OpenRLHF_RewardDataset_init

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment