Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Volcengine Verl Geo3K Data Preprocessing

From Leeroopedia
Revision as of 17:07, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Volcengine_Verl_Geo3K_Data_Preprocessing.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Field Value
Knowledge Sources verl source code, Geometry3K data preprocessing example
Domains Multimodal Data Preparation, VLM Training, Geometry
Last Updated 2026-02-07

Overview

Description

The Geometry3K data preprocessing script transforms the hiyouga/geometry3k HuggingFace dataset into verl's parquet training format with multimodal support. The key function make_map_fn(split) returns a mapping function that processes each example by:

  1. Extracting the geometry problem text and appending a chain-of-thought instruction suffix that requests \boxed{} formatted answers.
  2. Extracting the ground-truth answer string.
  3. Extracting the associated geometry diagram images as PIL Image objects.
  4. Constructing the standardized verl data row with "prompt" (chat message list), "images" (list of PIL images), "reward_model" (rule-based with ground truth), and "extra_info" metadata.

The critical VLM-specific aspect is the "images" column, which contains PIL Image objects that are later consumed by the RLHFDataset and processed through the VLM processor (e.g., Qwen2VLProcessor) to produce pixel values and image grid tensors.

The instruction following prompt uses a think-then-answer format: "You FIRST think about the reasoning process as an internal monologue and then provide the final answer. The reasoning process MUST BE enclosed within <think> </think> tags. The final answer MUST BE put in \boxed{}."

Usage

Run the preprocessing script from the command line to generate train and test parquet files. The output parquet files include image columns that can be loaded by verl's dataset classes.

Code Reference

Field Value
Source Location examples/data_preprocess/geo3k.py, Lines 37-102
Key Function make_map_fn(split) -> Callable (returns a process_fn(example, idx) closure)
Dataset hiyouga/geometry3k from HuggingFace
Output Format Parquet with columns: data_source, prompt, images, ability, reward_model, extra_info

I/O Contract

Inputs

Parameter Type Description
example["problem"] str The geometry problem text from the dataset.
example["answer"] str The ground-truth answer string.
example["images"] list[PIL.Image] The geometry diagram images associated with the problem.
split str Dataset split identifier ("train" or "test").
idx int Index of the example within the split.

Outputs

Field Type Description
data_source str Dataset identifier: "hiyouga/geometry3k".
prompt list[dict] Single-turn chat message: [{"role": "user", "content": problem + instruction}].
images list[PIL.Image] Geometry diagram images as PIL Image objects.
ability str Task ability type: "math".
reward_model dict {"style": "rule", "ground_truth": answer}.
extra_info dict Metadata: {"split": ..., "index": ..., "answer": ..., "question": ...}.

Usage Examples

Running the preprocessing script:

# Command line usage:
# python examples/data_preprocess/geo3k.py --local_save_dir ~/data/geo3k

The make_map_fn function implementation:

# From examples/data_preprocess/geo3k.py, Lines 58-85

instruction_following = (
    r"You FIRST think about the reasoning process as an internal monologue "
    r"and then provide the final answer. "
    r"The reasoning process MUST BE enclosed within <think> </think> tags. "
    r"The final answer MUST BE put in \boxed{}."
)

def make_map_fn(split):
    def process_fn(example, idx):
        problem = example.pop("problem")
        prompt = problem + " " + instruction_following
        answer = example.pop("answer")
        images = example.pop("images")

        data = {
            "data_source": "hiyouga/geometry3k",
            "prompt": [
                {
                    "role": "user",
                    "content": prompt,
                }
            ],
            "images": images,           # PIL Image objects for VLM processing
            "ability": "math",
            "reward_model": {"style": "rule", "ground_truth": answer},
            "extra_info": {
                "split": split,
                "index": idx,
                "answer": answer,
                "question": problem,
            },
        }
        return data

    return process_fn

# Apply the mapping to train and test splits
train_dataset = train_dataset.map(
    function=make_map_fn("train"), with_indices=True, num_proc=8
)
test_dataset = test_dataset.map(
    function=make_map_fn("test"), with_indices=True, num_proc=8
)

Full preprocessing pipeline:

import os
import datasets

data_source = "hiyouga/geometry3k"
dataset = datasets.load_dataset(data_source)

train_dataset = dataset["train"]
test_dataset = dataset["test"]

train_dataset = train_dataset.map(function=make_map_fn("train"), with_indices=True, num_proc=8)
test_dataset = test_dataset.map(function=make_map_fn("test"), with_indices=True, num_proc=8)

local_save_dir = os.path.expanduser("~/data/geo3k")
train_dataset.to_parquet(os.path.join(local_save_dir, "train.parquet"))
test_dataset.to_parquet(os.path.join(local_save_dir, "test.parquet"))

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment