Implementation:OpenRLHF OpenRLHF RewardDataset init

Knowledge Sources	OpenRLHF
Domains	Data_Processing, NLP, Reward_Modeling
Last Updated	2026-02-07 00:00 GMT

Overview

Concrete tool for constructing paired preference datasets for reward and DPO training provided by OpenRLHF.

Description

The RewardDataset class processes preference data with chosen/rejected response pairs. It tokenizes both responses, tracks prompt lengths for DPO mode, handles margin values for margin-based losses, and applies appropriate padding (left for RM, right for DPO). The collate function pads sequences to uniform length within each batch.

Usage

Instantiate after calling blending_datasets with a preference dataset. Use is_dpo=False for reward model training and is_dpo=True for DPO training.

Code Reference

Source Location

Repository: OpenRLHF
File: openrlhf/datasets/reward_dataset.py
Lines: L48-200 (class), L58-100 (__init__)

Signature

class RewardDataset(Dataset):
    def __init__(
        self,
        dataset,                # datasets.Dataset: raw preference data
        tokenizer: Callable,    # tokenizer for encoding
        max_length: int,        # maximum sequence length
        strategy,               # DeepspeedStrategy
        input_template=None,    # str: prompt formatting template
        is_dpo=False,           # bool: DPO mode (right padding + prompt lens)
        num_processors=8,       # int: parallel processing workers
    ) -> None:

Import

from openrlhf.datasets import RewardDataset

I/O Contract

Inputs

Name	Type	Required	Description
dataset	datasets.Dataset	Yes	Preference data with chosen/rejected columns
tokenizer	Callable	Yes	HuggingFace tokenizer
max_length	int	Yes	Maximum sequence length
is_dpo	bool	No	Enable DPO mode with right padding (default False)

Outputs

Name	Type	Description
__getitem__ returns	Tuple	(chosen_ids, chosen_mask, reject_ids, reject_mask, extra)

Usage Examples

Reward Model Training

from openrlhf.datasets import RewardDataset
from openrlhf.datasets.utils import blending_datasets

raw_data = blending_datasets(args.dataset, strategy=strategy)
train_dataset = RewardDataset(
    raw_data, tokenizer, args.max_len, strategy, is_dpo=False
)

DPO Training

train_dataset = RewardDataset(
    raw_data, tokenizer, args.max_len, strategy, is_dpo=True
)

Related Pages

Implements Principle

Principle:OpenRLHF_OpenRLHF_Preference_Dataset_Construction

Uses Heuristic

Heuristic:OpenRLHF_OpenRLHF_Packing_Samples_Efficiency_Tip

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment