Implementation:OpenRLHF OpenRLHF RewardDataset init
Appearance
| Knowledge Sources | |
|---|---|
| Domains | Data_Processing, NLP, Reward_Modeling |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Concrete tool for constructing paired preference datasets for reward and DPO training provided by OpenRLHF.
Description
The RewardDataset class processes preference data with chosen/rejected response pairs. It tokenizes both responses, tracks prompt lengths for DPO mode, handles margin values for margin-based losses, and applies appropriate padding (left for RM, right for DPO). The collate function pads sequences to uniform length within each batch.
Usage
Instantiate after calling blending_datasets with a preference dataset. Use is_dpo=False for reward model training and is_dpo=True for DPO training.
Code Reference
Source Location
- Repository: OpenRLHF
- File: openrlhf/datasets/reward_dataset.py
- Lines: L48-200 (class), L58-100 (__init__)
Signature
class RewardDataset(Dataset):
def __init__(
self,
dataset, # datasets.Dataset: raw preference data
tokenizer: Callable, # tokenizer for encoding
max_length: int, # maximum sequence length
strategy, # DeepspeedStrategy
input_template=None, # str: prompt formatting template
is_dpo=False, # bool: DPO mode (right padding + prompt lens)
num_processors=8, # int: parallel processing workers
) -> None:
Import
from openrlhf.datasets import RewardDataset
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| dataset | datasets.Dataset | Yes | Preference data with chosen/rejected columns |
| tokenizer | Callable | Yes | HuggingFace tokenizer |
| max_length | int | Yes | Maximum sequence length |
| is_dpo | bool | No | Enable DPO mode with right padding (default False) |
Outputs
| Name | Type | Description |
|---|---|---|
| __getitem__ returns | Tuple | (chosen_ids, chosen_mask, reject_ids, reject_mask, extra) |
Usage Examples
Reward Model Training
from openrlhf.datasets import RewardDataset
from openrlhf.datasets.utils import blending_datasets
raw_data = blending_datasets(args.dataset, strategy=strategy)
train_dataset = RewardDataset(
raw_data, tokenizer, args.max_len, strategy, is_dpo=False
)
DPO Training
train_dataset = RewardDataset(
raw_data, tokenizer, args.max_len, strategy, is_dpo=True
)
Related Pages
Implements Principle
Uses Heuristic
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment