Implementation:Volcengine Verl GSM8K Data Preprocessing

Field	Value
Knowledge Sources	API Doc (verl data preprocessing)
Domains	Data Preprocessing, Math Reasoning, Reward Model Configuration
Last Updated	2026-02-07

Overview

Description

This implementation preprocesses the GSM8K (Grade School Math 8K) dataset into a standardized Parquet format suitable for reinforcement learning training with verl. The script loads the openai/gsm8k dataset from HuggingFace, extracts ground-truth numeric solutions using a regex pattern, constructs chat-formatted prompts with an instruction-following suffix, and writes the results to train.parquet and test.parquet.

The key function extract_solution(solution_str) uses the regex pattern r"#### (\-?[0-9\.\,]+)" to locate the final numeric answer in each GSM8K solution string. The make_map_fn(split) closure returns a process_fn(example, idx) that transforms each raw example into the verl-standard schema with fields for data source, chat-style prompt, ability tag, reward model configuration, and extra metadata.

Usage

Execute the script directly from the command line:

python examples/data_preprocess/gsm8k.py --local_save_dir ~/data/gsm8k

Optional arguments:

--local_dataset_path -- Path to a local copy of the GSM8K dataset
--hdfs_dir -- HDFS path for remote storage copy

Code Reference

Attribute	Detail
Source Location	`examples/data_preprocess/gsm8k.py`, Lines 27-105
Signature (extract)	`def extract_solution(solution_str) -> str`
Signature (map fn)	`def make_map_fn(split) -> Callable[[dict, int], dict]`
Import	Script executed directly: `python examples/data_preprocess/gsm8k.py`

I/O Contract

Inputs

Parameter	Type	Description
`--local_save_dir`	`str`	Directory where output Parquet files are saved (default: `~/data/gsm8k`)
`--local_dataset_path`	`str` (optional)	Local path to a pre-downloaded GSM8K dataset
`--hdfs_dir`	`str` (optional)	HDFS directory for remote copy of output files
HuggingFace dataset	`openai/gsm8k`	Source dataset with `question` and `answer` columns

Outputs

Output	Type	Description
`train.parquet`	Parquet file	Training split with standardized columns
`test.parquet`	Parquet file	Test split with standardized columns

Output column schema:

Column	Type	Description
`data_source`	`str`	Always `"openai/gsm8k"`
`prompt`	`list[dict]`	Chat-formatted prompt: `[{"role": "user", "content": question + instruction}]`
`ability`	`str`	Always `"math"`
`reward_model`	`dict`	`{"style": "rule", "ground_truth": solution}`
`extra_info`	`dict`	Contains `split`, `index`, `answer` (raw), `question` (raw)

Usage Examples

Example 1: Extract numeric solution from GSM8K answer string

import re

def extract_solution(solution_str):
    solution = re.search(r"#### (\-?[0-9\.\,]+)", solution_str)
    assert solution is not None
    final_solution = solution.group(0)
    final_solution = final_solution.split("#### ")[1].replace(",", "")
    return final_solution

answer = "The total is 3 + 4 = 7\n#### 7"
print(extract_solution(answer))  # Output: "7"

Example 2: Transformed data record structure

# After processing, each record looks like:
record = {
    "data_source": "openai/gsm8k",
    "prompt": [
        {
            "role": "user",
            "content": 'Janet has 3 apples... Let\'s think step by step and output the final answer after "####".',
        }
    ],
    "ability": "math",
    "reward_model": {"style": "rule", "ground_truth": "7"},
    "extra_info": {
        "split": "train",
        "index": 0,
        "answer": "The total is 3 + 4 = 7\n#### 7",
        "question": "Janet has 3 apples...",
    },
}

Example 3: Full preprocessing pipeline

import datasets
from verl.utils.hdfs_io import copy, makedirs

dataset = datasets.load_dataset("openai/gsm8k", "main")
train_dataset = dataset["train"]

def make_map_fn(split):
    def process_fn(example, idx):
        question = example.pop("question") + ' Let\'s think step by step and output the final answer after "####".'
        answer_raw = example.pop("answer")
        solution = extract_solution(answer_raw)
        return {
            "data_source": "openai/gsm8k",
            "prompt": [{"role": "user", "content": question}],
            "ability": "math",
            "reward_model": {"style": "rule", "ground_truth": solution},
            "extra_info": {"split": split, "index": idx, "answer": answer_raw, "question": question},
        }
    return process_fn

train_dataset = train_dataset.map(function=make_map_fn("train"), with_indices=True)
train_dataset.to_parquet("~/data/gsm8k/train.parquet")

Related Pages

Principle:Volcengine_Verl_Data_Preparation_For_RL
examples/data_preprocess/gsm8k.py -- Source script
Implementation:Volcengine_Verl_Dataset_To_Parquet -- Parquet export wrapper
Implementation:Volcengine_Verl_Datasets_Load_Dataset -- Dataset loading wrapper

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment