Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Volcengine Verl HH RLHF Data Preprocessing

From Leeroopedia


Field Value
Knowledge Sources API Doc (verl data preprocessing)
Domains Data Preprocessing, RLHF Alignment, Reward Model Configuration
Last Updated 2026-02-07

Overview

Description

This implementation preprocesses the Dahoas/full-hh-rlhf (Helpful and Harmless RLHF) dataset into a standardized Parquet format for reinforcement learning training with verl. The script supports three output modes: SFT (supervised fine-tuning), RM (reward model), and RL (reinforcement learning). The RL preprocessing function generate_rl_dataset() transforms the raw dataset into the verl-standard schema with chat-formatted prompts, a model-based reward style, and alignment as the ability tag.

The key distinction from GSM8K preprocessing is the reward_model.style field being set to "model" rather than "rule", indicating that reward scoring during training is handled by a learned reward model rather than rule-based evaluation. The ability field is set to "alignment" to reflect the RLHF alignment objective.

Usage

Execute the script directly from the command line:

python examples/data_preprocess/full_hh_rlhf.py --split rl --local_save_dir ~/data/full_hh_rlhf

Available split modes: sft, rm, rl

Code Reference

Attribute Detail
Source Location examples/data_preprocess/full_hh_rlhf.py, Lines 93-131
Signature def generate_rl_dataset(target_hdfs_path_dir, local_dir="~/data/full_hh_rlhf/rl", local_dataset_path=None)
Import Script executed directly: python examples/data_preprocess/full_hh_rlhf.py --split rl

I/O Contract

Inputs

Parameter Type Description
target_hdfs_path_dir str or None HDFS target path for remote copy; None to skip
local_dir str Local directory for saving output Parquet (default: ~/data/full_hh_rlhf/rl)
local_dataset_path str or None Local path to a pre-downloaded copy of the dataset
HuggingFace dataset Dahoas/full-hh-rlhf Source dataset with prompt, chosen, rejected, and response columns

Outputs

Output Type Description
train.parquet Parquet file Training split in the verl-standard RL schema

Output column schema:

Column Type Description
data_source str Always "Dahoas/full-hh-rlhf"
prompt list[dict] Chat-formatted prompt: [{"role": "user", "content": prompt_text}]
ability str Always "alignment"
reward_model dict {"style": "model", "ground_truth": response}
extra_info dict Contains split and index fields

Usage Examples

Example 1: Generate RL dataset from HH-RLHF

from examples.data_preprocess.full_hh_rlhf import generate_rl_dataset

# Generate RL dataset locally
generate_rl_dataset(
    target_hdfs_path_dir=None,
    local_dir="~/data/full_hh_rlhf/rl",
    local_dataset_path=None,
)

Example 2: Transformed data record structure

# After processing, each record looks like:
record = {
    "data_source": "Dahoas/full-hh-rlhf",
    "prompt": [
        {"role": "user", "content": "\n\nHuman: How do I make a good cup of coffee?\n\nAssistant:"}
    ],
    "ability": "alignment",
    "reward_model": {
        "style": "model",
        "ground_truth": " Here are some tips for making great coffee...",
    },
    "extra_info": {"split": "train", "index": 0},
}

Example 3: Internal make_map_fn closure

def make_map_fn(split):
    def process_fn(example, idx):
        prompt = example.pop("prompt")
        response = example.pop("response")
        data = {
            "data_source": "Dahoas/full-hh-rlhf",
            "prompt": [{"role": "user", "content": prompt}],
            "ability": "alignment",
            "reward_model": {
                "style": "model",
                "ground_truth": response,  # should not be used directly
            },
            "extra_info": {"split": split, "index": idx},
        }
        return data
    return process_fn

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment