Principle:Volcengine Verl RLHF Data Preparation
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, RLHF, Alignment |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
The process of converting human preference datasets (with chosen/rejected response pairs) into verl's standardized parquet format for reward-model-based RL training.
Description
RLHF Data Preparation handles datasets that contain human preference annotations (chosen vs. rejected responses). Unlike rule-based reward datasets where ground truth is a single answer, RLHF datasets provide pairs of responses that implicitly encode human preferences.
The preprocessing extracts the prompt from preference pairs and configures the reward model style as "model" (indicating a learned reward model will score responses at training time rather than a deterministic function).
The HH-RLHF (Helpful and Harmless RLHF) dataset from Anthropic is the canonical example, but the same pattern applies to any preference dataset.
Usage
Use RLHF data preparation when:
- The training objective is alignment (helpfulness, harmlessness)
- A learned reward model will provide reward signals
- The source data contains chosen/rejected response pairs
Theoretical Basis
RLHF data preparation extracts the training signal from preference pairs:
# Abstract RLHF data preparation
for row in preference_dataset:
prompt = extract_prompt(row["prompt"]) # or row["chosen"]
# Reward model will score at training time
reward_config = {"style": "model", "ground_truth": ""}
output_row = {
"data_source": "hh_rlhf",
"prompt": [{"role": "user", "content": prompt}],
"ability": "alignment",
"reward_model": reward_config,
}