Implementation:Volcengine Verl Geo3K Data Preprocessing

Field	Value
Knowledge Sources	verl source code, Geometry3K data preprocessing example
Domains	Multimodal Data Preparation, VLM Training, Geometry
Last Updated	2026-02-07

Overview

Description

The Geometry3K data preprocessing script transforms the hiyouga/geometry3k HuggingFace dataset into verl's parquet training format with multimodal support. The key function make_map_fn(split) returns a mapping function that processes each example by:

Extracting the geometry problem text and appending a chain-of-thought instruction suffix that requests \boxed{} formatted answers.
Extracting the ground-truth answer string.
Extracting the associated geometry diagram images as PIL Image objects.
Constructing the standardized verl data row with "prompt" (chat message list), "images" (list of PIL images), "reward_model" (rule-based with ground truth), and "extra_info" metadata.

The critical VLM-specific aspect is the "images" column, which contains PIL Image objects that are later consumed by the RLHFDataset and processed through the VLM processor (e.g., Qwen2VLProcessor) to produce pixel values and image grid tensors.

The instruction following prompt uses a think-then-answer format: "You FIRST think about the reasoning process as an internal monologue and then provide the final answer. The reasoning process MUST BE enclosed within <think> </think> tags. The final answer MUST BE put in \boxed{}."

Usage

Run the preprocessing script from the command line to generate train and test parquet files. The output parquet files include image columns that can be loaded by verl's dataset classes.

Code Reference

Field	Value
Source Location	`examples/data_preprocess/geo3k.py`, Lines 37-102
Key Function	`make_map_fn(split) -> Callable` (returns a `process_fn(example, idx)` closure)
Dataset	`hiyouga/geometry3k` from HuggingFace
Output Format	Parquet with columns: `data_source`, `prompt`, `images`, `ability`, `reward_model`, `extra_info`

I/O Contract

Inputs

Parameter	Type	Description
`example["problem"]`	`str`	The geometry problem text from the dataset.
`example["answer"]`	`str`	The ground-truth answer string.
`example["images"]`	`list[PIL.Image]`	The geometry diagram images associated with the problem.
`split`	`str`	Dataset split identifier (`"train"` or `"test"`).
`idx`	`int`	Index of the example within the split.

Outputs

Field	Type	Description
`data_source`	`str`	Dataset identifier: `"hiyouga/geometry3k"`.
`prompt`	`list[dict]`	Single-turn chat message: `[{"role": "user", "content": problem + instruction}]`.
`images`	`list[PIL.Image]`	Geometry diagram images as PIL Image objects.
`ability`	`str`	Task ability type: `"math"`.
`reward_model`	`dict`	`{"style": "rule", "ground_truth": answer}`.
`extra_info`	`dict`	Metadata: `{"split": ..., "index": ..., "answer": ..., "question": ...}`.

Usage Examples

Running the preprocessing script:

# Command line usage:
# python examples/data_preprocess/geo3k.py --local_save_dir ~/data/geo3k

The make_map_fn function implementation:

# From examples/data_preprocess/geo3k.py, Lines 58-85

instruction_following = (
    r"You FIRST think about the reasoning process as an internal monologue "
    r"and then provide the final answer. "
    r"The reasoning process MUST BE enclosed within <think> </think> tags. "
    r"The final answer MUST BE put in \boxed{}."
)

def make_map_fn(split):
    def process_fn(example, idx):
        problem = example.pop("problem")
        prompt = problem + " " + instruction_following
        answer = example.pop("answer")
        images = example.pop("images")

        data = {
            "data_source": "hiyouga/geometry3k",
            "prompt": [
                {
                    "role": "user",
                    "content": prompt,
                }
            ],
            "images": images,           # PIL Image objects for VLM processing
            "ability": "math",
            "reward_model": {"style": "rule", "ground_truth": answer},
            "extra_info": {
                "split": split,
                "index": idx,
                "answer": answer,
                "question": problem,
            },
        }
        return data

    return process_fn

# Apply the mapping to train and test splits
train_dataset = train_dataset.map(
    function=make_map_fn("train"), with_indices=True, num_proc=8
)
test_dataset = test_dataset.map(
    function=make_map_fn("test"), with_indices=True, num_proc=8
)

Full preprocessing pipeline:

import os
import datasets

data_source = "hiyouga/geometry3k"
dataset = datasets.load_dataset(data_source)

train_dataset = dataset["train"]
test_dataset = dataset["test"]

train_dataset = train_dataset.map(function=make_map_fn("train"), with_indices=True, num_proc=8)
test_dataset = test_dataset.map(function=make_map_fn("test"), with_indices=True, num_proc=8)

local_save_dir = os.path.expanduser("~/data/geo3k")
train_dataset.to_parquet(os.path.join(local_save_dir, "train.parquet"))
test_dataset.to_parquet(os.path.join(local_save_dir, "test.parquet"))

Related Pages

Principle:Volcengine_Verl_Multimodal_Data_Preparation

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment