Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:OpenRLHF OpenRLHF RewardDataset init

From Leeroopedia


Knowledge Sources
Domains Data_Processing, NLP, Reward_Modeling
Last Updated 2026-02-07 00:00 GMT

Overview

Concrete tool for constructing paired preference datasets for reward and DPO training provided by OpenRLHF.

Description

The RewardDataset class processes preference data with chosen/rejected response pairs. It tokenizes both responses, tracks prompt lengths for DPO mode, handles margin values for margin-based losses, and applies appropriate padding (left for RM, right for DPO). The collate function pads sequences to uniform length within each batch.

Usage

Instantiate after calling blending_datasets with a preference dataset. Use is_dpo=False for reward model training and is_dpo=True for DPO training.

Code Reference

Source Location

  • Repository: OpenRLHF
  • File: openrlhf/datasets/reward_dataset.py
  • Lines: L48-200 (class), L58-100 (__init__)

Signature

class RewardDataset(Dataset):
    def __init__(
        self,
        dataset,                # datasets.Dataset: raw preference data
        tokenizer: Callable,    # tokenizer for encoding
        max_length: int,        # maximum sequence length
        strategy,               # DeepspeedStrategy
        input_template=None,    # str: prompt formatting template
        is_dpo=False,           # bool: DPO mode (right padding + prompt lens)
        num_processors=8,       # int: parallel processing workers
    ) -> None:

Import

from openrlhf.datasets import RewardDataset

I/O Contract

Inputs

Name Type Required Description
dataset datasets.Dataset Yes Preference data with chosen/rejected columns
tokenizer Callable Yes HuggingFace tokenizer
max_length int Yes Maximum sequence length
is_dpo bool No Enable DPO mode with right padding (default False)

Outputs

Name Type Description
__getitem__ returns Tuple (chosen_ids, chosen_mask, reject_ids, reject_mask, extra)

Usage Examples

Reward Model Training

from openrlhf.datasets import RewardDataset
from openrlhf.datasets.utils import blending_datasets

raw_data = blending_datasets(args.dataset, strategy=strategy)
train_dataset = RewardDataset(
    raw_data, tokenizer, args.max_len, strategy, is_dpo=False
)

DPO Training

train_dataset = RewardDataset(
    raw_data, tokenizer, args.max_len, strategy, is_dpo=True
)

Related Pages

Implements Principle

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment