Implementation:Volcengine Verl Geo3K Data Preprocessing
| Field | Value |
|---|---|
| Knowledge Sources | verl source code, Geometry3K data preprocessing example |
| Domains | Multimodal Data Preparation, VLM Training, Geometry |
| Last Updated | 2026-02-07 |
Overview
Description
The Geometry3K data preprocessing script transforms the hiyouga/geometry3k HuggingFace dataset into verl's parquet training format with multimodal support. The key function make_map_fn(split) returns a mapping function that processes each example by:
- Extracting the geometry problem text and appending a chain-of-thought instruction suffix that requests
\boxed{}formatted answers. - Extracting the ground-truth answer string.
- Extracting the associated geometry diagram images as PIL Image objects.
- Constructing the standardized verl data row with
"prompt"(chat message list),"images"(list of PIL images),"reward_model"(rule-based with ground truth), and"extra_info"metadata.
The critical VLM-specific aspect is the "images" column, which contains PIL Image objects that are later consumed by the RLHFDataset and processed through the VLM processor (e.g., Qwen2VLProcessor) to produce pixel values and image grid tensors.
The instruction following prompt uses a think-then-answer format:
"You FIRST think about the reasoning process as an internal monologue and then provide the final answer. The reasoning process MUST BE enclosed within <think> </think> tags. The final answer MUST BE put in \boxed{}."
Usage
Run the preprocessing script from the command line to generate train and test parquet files. The output parquet files include image columns that can be loaded by verl's dataset classes.
Code Reference
| Field | Value |
|---|---|
| Source Location | examples/data_preprocess/geo3k.py, Lines 37-102
|
| Key Function | make_map_fn(split) -> Callable (returns a process_fn(example, idx) closure)
|
| Dataset | hiyouga/geometry3k from HuggingFace
|
| Output Format | Parquet with columns: data_source, prompt, images, ability, reward_model, extra_info
|
I/O Contract
Inputs
| Parameter | Type | Description |
|---|---|---|
example["problem"] |
str |
The geometry problem text from the dataset. |
example["answer"] |
str |
The ground-truth answer string. |
example["images"] |
list[PIL.Image] |
The geometry diagram images associated with the problem. |
split |
str |
Dataset split identifier ("train" or "test").
|
idx |
int |
Index of the example within the split. |
Outputs
| Field | Type | Description |
|---|---|---|
data_source |
str |
Dataset identifier: "hiyouga/geometry3k".
|
prompt |
list[dict] |
Single-turn chat message: [{"role": "user", "content": problem + instruction}].
|
images |
list[PIL.Image] |
Geometry diagram images as PIL Image objects. |
ability |
str |
Task ability type: "math".
|
reward_model |
dict |
{"style": "rule", "ground_truth": answer}.
|
extra_info |
dict |
Metadata: {"split": ..., "index": ..., "answer": ..., "question": ...}.
|
Usage Examples
Running the preprocessing script:
# Command line usage:
# python examples/data_preprocess/geo3k.py --local_save_dir ~/data/geo3k
The make_map_fn function implementation:
# From examples/data_preprocess/geo3k.py, Lines 58-85
instruction_following = (
r"You FIRST think about the reasoning process as an internal monologue "
r"and then provide the final answer. "
r"The reasoning process MUST BE enclosed within <think> </think> tags. "
r"The final answer MUST BE put in \boxed{}."
)
def make_map_fn(split):
def process_fn(example, idx):
problem = example.pop("problem")
prompt = problem + " " + instruction_following
answer = example.pop("answer")
images = example.pop("images")
data = {
"data_source": "hiyouga/geometry3k",
"prompt": [
{
"role": "user",
"content": prompt,
}
],
"images": images, # PIL Image objects for VLM processing
"ability": "math",
"reward_model": {"style": "rule", "ground_truth": answer},
"extra_info": {
"split": split,
"index": idx,
"answer": answer,
"question": problem,
},
}
return data
return process_fn
# Apply the mapping to train and test splits
train_dataset = train_dataset.map(
function=make_map_fn("train"), with_indices=True, num_proc=8
)
test_dataset = test_dataset.map(
function=make_map_fn("test"), with_indices=True, num_proc=8
)
Full preprocessing pipeline:
import os
import datasets
data_source = "hiyouga/geometry3k"
dataset = datasets.load_dataset(data_source)
train_dataset = dataset["train"]
test_dataset = dataset["test"]
train_dataset = train_dataset.map(function=make_map_fn("train"), with_indices=True, num_proc=8)
test_dataset = test_dataset.map(function=make_map_fn("test"), with_indices=True, num_proc=8)
local_save_dir = os.path.expanduser("~/data/geo3k")
train_dataset.to_parquet(os.path.join(local_save_dir, "train.parquet"))
test_dataset.to_parquet(os.path.join(local_save_dir, "test.parquet"))