Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Volcengine Verl GSM8K Data Preprocessing

From Leeroopedia


Field Value
Knowledge Sources API Doc (verl data preprocessing)
Domains Data Preprocessing, Math Reasoning, Reward Model Configuration
Last Updated 2026-02-07

Overview

Description

This implementation preprocesses the GSM8K (Grade School Math 8K) dataset into a standardized Parquet format suitable for reinforcement learning training with verl. The script loads the openai/gsm8k dataset from HuggingFace, extracts ground-truth numeric solutions using a regex pattern, constructs chat-formatted prompts with an instruction-following suffix, and writes the results to train.parquet and test.parquet.

The key function extract_solution(solution_str) uses the regex pattern r"#### (\-?[0-9\.\,]+)" to locate the final numeric answer in each GSM8K solution string. The make_map_fn(split) closure returns a process_fn(example, idx) that transforms each raw example into the verl-standard schema with fields for data source, chat-style prompt, ability tag, reward model configuration, and extra metadata.

Usage

Execute the script directly from the command line:

python examples/data_preprocess/gsm8k.py --local_save_dir ~/data/gsm8k

Optional arguments:

  • --local_dataset_path -- Path to a local copy of the GSM8K dataset
  • --hdfs_dir -- HDFS path for remote storage copy

Code Reference

Attribute Detail
Source Location examples/data_preprocess/gsm8k.py, Lines 27-105
Signature (extract) def extract_solution(solution_str) -> str
Signature (map fn) def make_map_fn(split) -> Callable[[dict, int], dict]
Import Script executed directly: python examples/data_preprocess/gsm8k.py

I/O Contract

Inputs

Parameter Type Description
--local_save_dir str Directory where output Parquet files are saved (default: ~/data/gsm8k)
--local_dataset_path str (optional) Local path to a pre-downloaded GSM8K dataset
--hdfs_dir str (optional) HDFS directory for remote copy of output files
HuggingFace dataset openai/gsm8k Source dataset with question and answer columns

Outputs

Output Type Description
train.parquet Parquet file Training split with standardized columns
test.parquet Parquet file Test split with standardized columns

Output column schema:

Column Type Description
data_source str Always "openai/gsm8k"
prompt list[dict] Chat-formatted prompt: [{"role": "user", "content": question + instruction}]
ability str Always "math"
reward_model dict {"style": "rule", "ground_truth": solution}
extra_info dict Contains split, index, answer (raw), question (raw)

Usage Examples

Example 1: Extract numeric solution from GSM8K answer string

import re

def extract_solution(solution_str):
    solution = re.search(r"#### (\-?[0-9\.\,]+)", solution_str)
    assert solution is not None
    final_solution = solution.group(0)
    final_solution = final_solution.split("#### ")[1].replace(",", "")
    return final_solution

answer = "The total is 3 + 4 = 7\n#### 7"
print(extract_solution(answer))  # Output: "7"

Example 2: Transformed data record structure

# After processing, each record looks like:
record = {
    "data_source": "openai/gsm8k",
    "prompt": [
        {
            "role": "user",
            "content": 'Janet has 3 apples... Let\'s think step by step and output the final answer after "####".',
        }
    ],
    "ability": "math",
    "reward_model": {"style": "rule", "ground_truth": "7"},
    "extra_info": {
        "split": "train",
        "index": 0,
        "answer": "The total is 3 + 4 = 7\n#### 7",
        "question": "Janet has 3 apples...",
    },
}

Example 3: Full preprocessing pipeline

import datasets
from verl.utils.hdfs_io import copy, makedirs

dataset = datasets.load_dataset("openai/gsm8k", "main")
train_dataset = dataset["train"]

def make_map_fn(split):
    def process_fn(example, idx):
        question = example.pop("question") + ' Let\'s think step by step and output the final answer after "####".'
        answer_raw = example.pop("answer")
        solution = extract_solution(answer_raw)
        return {
            "data_source": "openai/gsm8k",
            "prompt": [{"role": "user", "content": question}],
            "ability": "math",
            "reward_model": {"style": "rule", "ground_truth": solution},
            "extra_info": {"split": split, "index": idx, "answer": answer_raw, "question": question},
        }
    return process_fn

train_dataset = train_dataset.map(function=make_map_fn("train"), with_indices=True)
train_dataset.to_parquet("~/data/gsm8k/train.parquet")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment