Implementation:Open compass VLMEvalKit WeMath Utils

Field	Value
source	VLMEvalKit
domain	Vision, Evaluation, Mathematics, Multi-step Reasoning

Overview

Provides evaluation utilities for the WeMath benchmark, implementing four-dimensional metrics for multi-step mathematical reasoning assessment.

Description

This module implements evaluate_evaluate_steps for evaluating individual knowledge concept steps and evaluate_process_steps_data for merging multi-step evaluation results. The load_and_process_data function loads prediction files, extracts answers from model responses by parsing the first letter after "Answer", and computes per-step correctness (joker) scores. The four-dimensional evaluation framework assesses knowledge concept mastery across multiple reasoning steps, merging step-wise results into a consolidated evaluation. It supports both pre-scored data (with 'hit' column) and raw predictions requiring answer extraction.

Usage

Called internally by the WeMath dataset class during multi-step mathematical reasoning evaluation.

Code Reference

Source: vlmeval/dataset/utils/wemath.py, Lines: L1-898
Import: from vlmeval.dataset.utils.wemath import load_and_process_data, evaluate_evaluate_steps

Key Functions:

def evaluate_evaluate_steps(json, steps): ...
def load_and_process_data(filepath): ...
def evaluate_process_steps_data(df, steps): ...

I/O Contract

Direction	Description
Inputs	Scored data file path or DataFrame with prediction/answer columns; number of reasoning steps to evaluate
Outputs	DataFrame with per-step joker (correctness) scores; merged multi-step evaluation DataFrame with knowledge concept mappings

Usage Examples

# Internal usage example
from vlmeval.dataset.utils.wemath import load_and_process_data, evaluate_process_steps_data
df = load_and_process_data("predictions.xlsx")
merged = evaluate_process_steps_data(df, steps=3)

Related Pages

Principle:Open_compass_VLMEvalKit_Benchmark_Dataset_Construction

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment