Principle:OpenRLHF OpenRLHF Preference Dataset Construction
| Knowledge Sources | |
|---|---|
| Domains | Data_Processing, NLP, Reward_Modeling |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
A dataset preparation technique that tokenizes paired preference data (chosen vs rejected responses) for reward model training and direct preference optimization.
Description
Preference Dataset Construction processes human preference data where each example contains a prompt with a chosen (preferred) and rejected (dispreferred) response. The dataset tokenizes both responses, handles padding asymmetry between chosen and rejected sequences, and supports two modes: RM mode (left-padded for reward scoring) and DPO mode (right-padded for log-probability computation with prompt length tracking).
Usage
Use this principle when preparing data for reward model training (is_dpo=False) or DPO/iterative DPO training (is_dpo=True). The same RewardDataset class serves both use cases with the is_dpo flag controlling padding direction and output format.
Theoretical Basis
For Reward Model training: The model learns a reward function such that:
For DPO: The implicit reward is derived from log-probability ratios:
In both cases, the dataset must provide separately tokenized chosen and rejected sequences for contrastive comparison.