Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Sail sg LongSpec Benchmark Eval Script

From Leeroopedia
Knowledge Sources
Domains NLP, Evaluation, Mathematics
Last Updated 2026-02-14 05:00 GMT

Overview

Concrete tool for evaluating math benchmark predictions across 8+ benchmarks using unified correctness checking with per-benchmark answer parsing.

Description

The eval_script.py module provides evaluation functions for multiple math benchmarks: MATH (eval_math), GSM8K (eval_last_single_answer), AGIEval Gaokao MathCloze (eval_agieval_gaokao_math_cloze), AGIEval Gaokao MathQA (eval_agieval_gaokao_mathqa), SAT (eval_math_sat), MMLU-STEM (eval_mmlu_stem), OCW Courses (eval_ocwcourses), and MiniF2F-Isabelle (eval_minif2f_isabelle). Each function takes a prediction item dict with "prediction" and "answer" keys and returns a boolean correctness result.

Usage

Import the appropriate evaluation function for the target benchmark. These functions are referenced by DeepSeekMathCallBack via its eval_fns mapping to evaluate model outputs during post-processing.

Code Reference

Source Location

Signature

def is_correct(item: dict, pred_key: str = 'prediction', prec: float = 1e-3) -> bool:
    """Core correctness check using math_equal with list/union support."""

def eval_math(item: dict, pred_key: str = 'prediction', prec: float = 1e-3) -> bool:
    """Evaluate MATH benchmark with deduplication and list answer support."""

def eval_last_single_answer(item: dict, pred_key: str = 'prediction', prec: float = 1e-3) -> bool:
    """Evaluate single string answer (used for GSM8K)."""

def eval_ocwcourses(item: dict, pred_key: str = 'prediction', prec: float = 1e-3) -> int:
    """Evaluate OCW Courses with numeric/equation/expression type detection."""

def eval_agieval_gaokao_math_cloze(item: dict, pred_key: str = 'prediction', prec: float = 1e-3) -> bool:
    """Evaluate AGIEval Gaokao math cloze with multi-answer parsing."""

def eval_agieval_gaokao_mathqa(item: dict, pred_key: str = 'prediction', prec: float = 1e-3) -> bool:
    """Evaluate AGIEval Gaokao math QA (multiple choice A/B/C/D)."""

def eval_math_sat(item: dict, pred_key: str = 'prediction', prec: float = 1e-3) -> bool:
    """Evaluate SAT math (case-insensitive string match)."""

Import

from data.deepseek_math_utils.eval_script import eval_math, eval_last_single_answer, eval_ocwcourses

I/O Contract

Inputs

Name Type Required Description
item dict Yes Dict with "prediction" and "answer" keys
pred_key str No Key for prediction in item dict (default "prediction")
prec float No Numerical precision tolerance (default 1e-3)

Outputs

Name Type Description
result bool (or int for OCW) True/1 if prediction matches answer

Usage Examples

from data.deepseek_math_utils.eval_script import eval_math, eval_last_single_answer

# Evaluate MATH benchmark
result = eval_math({"prediction": ["\\frac{1}{2}"], "answer": ["0.5"]})
# result = True

# Evaluate GSM8K
result = eval_last_single_answer({"prediction": "42", "answer": "42"})
# result = True

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment