Principle:Volcengine Verl Data Preparation For RL
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, Reinforcement_Learning, NLP |
| Last Updated | 2026-02-07 14:00 GMT |
Overview
The process of converting raw datasets into a standardized parquet format with prompt templates, extracted ground truth, and reward configuration for reinforcement learning training.
Description
Data Preparation for RL transforms raw HuggingFace datasets into verl's standardized schema. Each row in the output parquet file contains:
- data_source: Identifier for the dataset (used to select the appropriate reward function)
- prompt: Chat-formatted messages (OpenAI format: list of role/content dicts)
- ability: Task category tag (e.g., "math", "alignment")
- reward_model: Configuration dict specifying reward computation style and ground truth
- extra_info: Additional metadata (e.g., tool kwargs for multi-turn)
This standardization allows the same training pipeline to work across diverse tasks by decoupling data format from training logic.
Usage
Data preparation is the first step before any RL training run. Each dataset type requires its own preprocessing script that handles:
- Extracting questions/prompts and formatting them as chat messages
- Parsing answers/solutions to extract verifiable ground truth
- Configuring the reward mechanism (rule-based vs. model-based)
- Splitting into train/test sets and exporting to parquet
Theoretical Basis
The data preparation pipeline follows a functional transformation pattern:
# Abstract data preparation pipeline
raw_dataset = load_dataset(source)
processed = raw_dataset.map(
lambda row: {
"data_source": dataset_name,
"prompt": format_as_chat(row["question"]),
"ability": task_category,
"reward_model": {"style": "rule", "ground_truth": extract_answer(row)},
"extra_info": {}
}
)
processed.to_parquet(output_path)