Implementation:Open compass VLMEvalKit WeMath Utils
| Field | Value |
|---|---|
| source | VLMEvalKit |
| domain | Vision, Evaluation, Mathematics, Multi-step Reasoning |
Overview
Provides evaluation utilities for the WeMath benchmark, implementing four-dimensional metrics for multi-step mathematical reasoning assessment.
Description
This module implements evaluate_evaluate_steps for evaluating individual knowledge concept steps and evaluate_process_steps_data for merging multi-step evaluation results. The load_and_process_data function loads prediction files, extracts answers from model responses by parsing the first letter after "Answer", and computes per-step correctness (joker) scores. The four-dimensional evaluation framework assesses knowledge concept mastery across multiple reasoning steps, merging step-wise results into a consolidated evaluation. It supports both pre-scored data (with 'hit' column) and raw predictions requiring answer extraction.
Usage
Called internally by the WeMath dataset class during multi-step mathematical reasoning evaluation.
Code Reference
- Source:
vlmeval/dataset/utils/wemath.py, Lines: L1-898 - Import:
from vlmeval.dataset.utils.wemath import load_and_process_data, evaluate_evaluate_steps
Key Functions:
def evaluate_evaluate_steps(json, steps): ...
def load_and_process_data(filepath): ...
def evaluate_process_steps_data(df, steps): ...
I/O Contract
| Direction | Description |
|---|---|
| Inputs | Scored data file path or DataFrame with prediction/answer columns; number of reasoning steps to evaluate |
| Outputs | DataFrame with per-step joker (correctness) scores; merged multi-step evaluation DataFrame with knowledge concept mappings |
Usage Examples
# Internal usage example
from vlmeval.dataset.utils.wemath import load_and_process_data, evaluate_process_steps_data
df = load_and_process_data("predictions.xlsx")
merged = evaluate_process_steps_data(df, steps=3)