Implementation:Volcengine Verl HH RLHF Data Preprocessing
| Field | Value |
|---|---|
| Knowledge Sources | API Doc (verl data preprocessing) |
| Domains | Data Preprocessing, RLHF Alignment, Reward Model Configuration |
| Last Updated | 2026-02-07 |
Overview
Description
This implementation preprocesses the Dahoas/full-hh-rlhf (Helpful and Harmless RLHF) dataset into a standardized Parquet format for reinforcement learning training with verl. The script supports three output modes: SFT (supervised fine-tuning), RM (reward model), and RL (reinforcement learning). The RL preprocessing function generate_rl_dataset() transforms the raw dataset into the verl-standard schema with chat-formatted prompts, a model-based reward style, and alignment as the ability tag.
The key distinction from GSM8K preprocessing is the reward_model.style field being set to "model" rather than "rule", indicating that reward scoring during training is handled by a learned reward model rather than rule-based evaluation. The ability field is set to "alignment" to reflect the RLHF alignment objective.
Usage
Execute the script directly from the command line:
python examples/data_preprocess/full_hh_rlhf.py --split rl --local_save_dir ~/data/full_hh_rlhf
Available split modes: sft, rm, rl
Code Reference
| Attribute | Detail |
|---|---|
| Source Location | examples/data_preprocess/full_hh_rlhf.py, Lines 93-131
|
| Signature | def generate_rl_dataset(target_hdfs_path_dir, local_dir="~/data/full_hh_rlhf/rl", local_dataset_path=None)
|
| Import | Script executed directly: python examples/data_preprocess/full_hh_rlhf.py --split rl
|
I/O Contract
Inputs
| Parameter | Type | Description |
|---|---|---|
target_hdfs_path_dir |
str or None |
HDFS target path for remote copy; None to skip
|
local_dir |
str |
Local directory for saving output Parquet (default: ~/data/full_hh_rlhf/rl)
|
local_dataset_path |
str or None |
Local path to a pre-downloaded copy of the dataset |
| HuggingFace dataset | Dahoas/full-hh-rlhf |
Source dataset with prompt, chosen, rejected, and response columns
|
Outputs
| Output | Type | Description |
|---|---|---|
train.parquet |
Parquet file | Training split in the verl-standard RL schema |
Output column schema:
| Column | Type | Description |
|---|---|---|
data_source |
str |
Always "Dahoas/full-hh-rlhf"
|
prompt |
list[dict] |
Chat-formatted prompt: [{"role": "user", "content": prompt_text}]
|
ability |
str |
Always "alignment"
|
reward_model |
dict |
{"style": "model", "ground_truth": response}
|
extra_info |
dict |
Contains split and index fields
|
Usage Examples
Example 1: Generate RL dataset from HH-RLHF
from examples.data_preprocess.full_hh_rlhf import generate_rl_dataset
# Generate RL dataset locally
generate_rl_dataset(
target_hdfs_path_dir=None,
local_dir="~/data/full_hh_rlhf/rl",
local_dataset_path=None,
)
Example 2: Transformed data record structure
# After processing, each record looks like:
record = {
"data_source": "Dahoas/full-hh-rlhf",
"prompt": [
{"role": "user", "content": "\n\nHuman: How do I make a good cup of coffee?\n\nAssistant:"}
],
"ability": "alignment",
"reward_model": {
"style": "model",
"ground_truth": " Here are some tips for making great coffee...",
},
"extra_info": {"split": "train", "index": 0},
}
Example 3: Internal make_map_fn closure
def make_map_fn(split):
def process_fn(example, idx):
prompt = example.pop("prompt")
response = example.pop("response")
data = {
"data_source": "Dahoas/full-hh-rlhf",
"prompt": [{"role": "user", "content": prompt}],
"ability": "alignment",
"reward_model": {
"style": "model",
"ground_truth": response, # should not be used directly
},
"extra_info": {"split": split, "index": idx},
}
return data
return process_fn
Related Pages
- Principle:Volcengine_Verl_RLHF_Data_Preparation
- examples/data_preprocess/full_hh_rlhf.py -- Source script
- Implementation:Volcengine_Verl_GSM8K_Data_Preprocessing -- Comparable math dataset preprocessing
- Implementation:Volcengine_Verl_Datasets_Load_Dataset -- Dataset loading wrapper