Implementation:Volcengine Verl HH RLHF Data Preprocessing

Field	Value
Knowledge Sources	API Doc (verl data preprocessing)
Domains	Data Preprocessing, RLHF Alignment, Reward Model Configuration
Last Updated	2026-02-07

Overview

Description

This implementation preprocesses the Dahoas/full-hh-rlhf (Helpful and Harmless RLHF) dataset into a standardized Parquet format for reinforcement learning training with verl. The script supports three output modes: SFT (supervised fine-tuning), RM (reward model), and RL (reinforcement learning). The RL preprocessing function generate_rl_dataset() transforms the raw dataset into the verl-standard schema with chat-formatted prompts, a model-based reward style, and alignment as the ability tag.

The key distinction from GSM8K preprocessing is the reward_model.style field being set to "model" rather than "rule", indicating that reward scoring during training is handled by a learned reward model rather than rule-based evaluation. The ability field is set to "alignment" to reflect the RLHF alignment objective.

Usage

Execute the script directly from the command line:

python examples/data_preprocess/full_hh_rlhf.py --split rl --local_save_dir ~/data/full_hh_rlhf

Available split modes: sft, rm, rl

Code Reference

Attribute	Detail
Source Location	`examples/data_preprocess/full_hh_rlhf.py`, Lines 93-131
Signature	`def generate_rl_dataset(target_hdfs_path_dir, local_dir="~/data/full_hh_rlhf/rl", local_dataset_path=None)`
Import	Script executed directly: `python examples/data_preprocess/full_hh_rlhf.py --split rl`

I/O Contract

Inputs

Parameter	Type	Description
`target_hdfs_path_dir`	`str` or `None`	HDFS target path for remote copy; `None` to skip
`local_dir`	`str`	Local directory for saving output Parquet (default: `~/data/full_hh_rlhf/rl`)
`local_dataset_path`	`str` or `None`	Local path to a pre-downloaded copy of the dataset
HuggingFace dataset	`Dahoas/full-hh-rlhf`	Source dataset with `prompt`, `chosen`, `rejected`, and `response` columns

Outputs

Output	Type	Description
`train.parquet`	Parquet file	Training split in the verl-standard RL schema

Output column schema:

Column	Type	Description
`data_source`	`str`	Always `"Dahoas/full-hh-rlhf"`
`prompt`	`list[dict]`	Chat-formatted prompt: `[{"role": "user", "content": prompt_text}]`
`ability`	`str`	Always `"alignment"`
`reward_model`	`dict`	`{"style": "model", "ground_truth": response}`
`extra_info`	`dict`	Contains `split` and `index` fields

Usage Examples

Example 1: Generate RL dataset from HH-RLHF

from examples.data_preprocess.full_hh_rlhf import generate_rl_dataset

# Generate RL dataset locally
generate_rl_dataset(
    target_hdfs_path_dir=None,
    local_dir="~/data/full_hh_rlhf/rl",
    local_dataset_path=None,
)

Example 2: Transformed data record structure

# After processing, each record looks like:
record = {
    "data_source": "Dahoas/full-hh-rlhf",
    "prompt": [
        {"role": "user", "content": "\n\nHuman: How do I make a good cup of coffee?\n\nAssistant:"}
    ],
    "ability": "alignment",
    "reward_model": {
        "style": "model",
        "ground_truth": " Here are some tips for making great coffee...",
    },
    "extra_info": {"split": "train", "index": 0},
}

Example 3: Internal make_map_fn closure

def make_map_fn(split):
    def process_fn(example, idx):
        prompt = example.pop("prompt")
        response = example.pop("response")
        data = {
            "data_source": "Dahoas/full-hh-rlhf",
            "prompt": [{"role": "user", "content": prompt}],
            "ability": "alignment",
            "reward_model": {
                "style": "model",
                "ground_truth": response,  # should not be used directly
            },
            "extra_info": {"split": split, "index": idx},
        }
        return data
    return process_fn

Related Pages

Principle:Volcengine_Verl_RLHF_Data_Preparation
examples/data_preprocess/full_hh_rlhf.py -- Source script
Implementation:Volcengine_Verl_GSM8K_Data_Preprocessing -- Comparable math dataset preprocessing
Implementation:Volcengine_Verl_Datasets_Load_Dataset -- Dataset loading wrapper

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment