Implementation:Volcengine Verl GSM8K Data Preprocessing
| Field | Value |
|---|---|
| Knowledge Sources | API Doc (verl data preprocessing) |
| Domains | Data Preprocessing, Math Reasoning, Reward Model Configuration |
| Last Updated | 2026-02-07 |
Overview
Description
This implementation preprocesses the GSM8K (Grade School Math 8K) dataset into a standardized Parquet format suitable for reinforcement learning training with verl. The script loads the openai/gsm8k dataset from HuggingFace, extracts ground-truth numeric solutions using a regex pattern, constructs chat-formatted prompts with an instruction-following suffix, and writes the results to train.parquet and test.parquet.
The key function extract_solution(solution_str) uses the regex pattern r"#### (\-?[0-9\.\,]+)" to locate the final numeric answer in each GSM8K solution string. The make_map_fn(split) closure returns a process_fn(example, idx) that transforms each raw example into the verl-standard schema with fields for data source, chat-style prompt, ability tag, reward model configuration, and extra metadata.
Usage
Execute the script directly from the command line:
python examples/data_preprocess/gsm8k.py --local_save_dir ~/data/gsm8k
Optional arguments:
--local_dataset_path-- Path to a local copy of the GSM8K dataset--hdfs_dir-- HDFS path for remote storage copy
Code Reference
| Attribute | Detail |
|---|---|
| Source Location | examples/data_preprocess/gsm8k.py, Lines 27-105
|
| Signature (extract) | def extract_solution(solution_str) -> str
|
| Signature (map fn) | def make_map_fn(split) -> Callable[[dict, int], dict]
|
| Import | Script executed directly: python examples/data_preprocess/gsm8k.py
|
I/O Contract
Inputs
| Parameter | Type | Description |
|---|---|---|
--local_save_dir |
str |
Directory where output Parquet files are saved (default: ~/data/gsm8k)
|
--local_dataset_path |
str (optional) |
Local path to a pre-downloaded GSM8K dataset |
--hdfs_dir |
str (optional) |
HDFS directory for remote copy of output files |
| HuggingFace dataset | openai/gsm8k |
Source dataset with question and answer columns
|
Outputs
| Output | Type | Description |
|---|---|---|
train.parquet |
Parquet file | Training split with standardized columns |
test.parquet |
Parquet file | Test split with standardized columns |
Output column schema:
| Column | Type | Description |
|---|---|---|
data_source |
str |
Always "openai/gsm8k"
|
prompt |
list[dict] |
Chat-formatted prompt: [{"role": "user", "content": question + instruction}]
|
ability |
str |
Always "math"
|
reward_model |
dict |
{"style": "rule", "ground_truth": solution}
|
extra_info |
dict |
Contains split, index, answer (raw), question (raw)
|
Usage Examples
Example 1: Extract numeric solution from GSM8K answer string
import re
def extract_solution(solution_str):
solution = re.search(r"#### (\-?[0-9\.\,]+)", solution_str)
assert solution is not None
final_solution = solution.group(0)
final_solution = final_solution.split("#### ")[1].replace(",", "")
return final_solution
answer = "The total is 3 + 4 = 7\n#### 7"
print(extract_solution(answer)) # Output: "7"
Example 2: Transformed data record structure
# After processing, each record looks like:
record = {
"data_source": "openai/gsm8k",
"prompt": [
{
"role": "user",
"content": 'Janet has 3 apples... Let\'s think step by step and output the final answer after "####".',
}
],
"ability": "math",
"reward_model": {"style": "rule", "ground_truth": "7"},
"extra_info": {
"split": "train",
"index": 0,
"answer": "The total is 3 + 4 = 7\n#### 7",
"question": "Janet has 3 apples...",
},
}
Example 3: Full preprocessing pipeline
import datasets
from verl.utils.hdfs_io import copy, makedirs
dataset = datasets.load_dataset("openai/gsm8k", "main")
train_dataset = dataset["train"]
def make_map_fn(split):
def process_fn(example, idx):
question = example.pop("question") + ' Let\'s think step by step and output the final answer after "####".'
answer_raw = example.pop("answer")
solution = extract_solution(answer_raw)
return {
"data_source": "openai/gsm8k",
"prompt": [{"role": "user", "content": question}],
"ability": "math",
"reward_model": {"style": "rule", "ground_truth": solution},
"extra_info": {"split": split, "index": idx, "answer": answer_raw, "question": question},
}
return process_fn
train_dataset = train_dataset.map(function=make_map_fn("train"), with_indices=True)
train_dataset.to_parquet("~/data/gsm8k/train.parquet")
Related Pages
- Principle:Volcengine_Verl_Data_Preparation_For_RL
- examples/data_preprocess/gsm8k.py -- Source script
- Implementation:Volcengine_Verl_Dataset_To_Parquet -- Parquet export wrapper
- Implementation:Volcengine_Verl_Datasets_Load_Dataset -- Dataset loading wrapper