Implementation:Sail sg LongSpec Benchmark Eval Script
| Knowledge Sources | |
|---|---|
| Domains | NLP, Evaluation, Mathematics |
| Last Updated | 2026-02-14 05:00 GMT |
Overview
Concrete tool for evaluating math benchmark predictions across 8+ benchmarks using unified correctness checking with per-benchmark answer parsing.
Description
The eval_script.py module provides evaluation functions for multiple math benchmarks: MATH (eval_math), GSM8K (eval_last_single_answer), AGIEval Gaokao MathCloze (eval_agieval_gaokao_math_cloze), AGIEval Gaokao MathQA (eval_agieval_gaokao_mathqa), SAT (eval_math_sat), MMLU-STEM (eval_mmlu_stem), OCW Courses (eval_ocwcourses), and MiniF2F-Isabelle (eval_minif2f_isabelle). Each function takes a prediction item dict with "prediction" and "answer" keys and returns a boolean correctness result.
Usage
Import the appropriate evaluation function for the target benchmark. These functions are referenced by DeepSeekMathCallBack via its eval_fns mapping to evaluate model outputs during post-processing.
Code Reference
Source Location
- Repository: Sail_sg_LongSpec
- File: longspec/train/data/deepseek_math_utils/eval_script.py
- Lines: 1-182
Signature
def is_correct(item: dict, pred_key: str = 'prediction', prec: float = 1e-3) -> bool:
"""Core correctness check using math_equal with list/union support."""
def eval_math(item: dict, pred_key: str = 'prediction', prec: float = 1e-3) -> bool:
"""Evaluate MATH benchmark with deduplication and list answer support."""
def eval_last_single_answer(item: dict, pred_key: str = 'prediction', prec: float = 1e-3) -> bool:
"""Evaluate single string answer (used for GSM8K)."""
def eval_ocwcourses(item: dict, pred_key: str = 'prediction', prec: float = 1e-3) -> int:
"""Evaluate OCW Courses with numeric/equation/expression type detection."""
def eval_agieval_gaokao_math_cloze(item: dict, pred_key: str = 'prediction', prec: float = 1e-3) -> bool:
"""Evaluate AGIEval Gaokao math cloze with multi-answer parsing."""
def eval_agieval_gaokao_mathqa(item: dict, pred_key: str = 'prediction', prec: float = 1e-3) -> bool:
"""Evaluate AGIEval Gaokao math QA (multiple choice A/B/C/D)."""
def eval_math_sat(item: dict, pred_key: str = 'prediction', prec: float = 1e-3) -> bool:
"""Evaluate SAT math (case-insensitive string match)."""
Import
from data.deepseek_math_utils.eval_script import eval_math, eval_last_single_answer, eval_ocwcourses
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| item | dict | Yes | Dict with "prediction" and "answer" keys |
| pred_key | str | No | Key for prediction in item dict (default "prediction") |
| prec | float | No | Numerical precision tolerance (default 1e-3) |
Outputs
| Name | Type | Description |
|---|---|---|
| result | bool (or int for OCW) | True/1 if prediction matches answer |
Usage Examples
from data.deepseek_math_utils.eval_script import eval_math, eval_last_single_answer
# Evaluate MATH benchmark
result = eval_math({"prediction": ["\\frac{1}{2}"], "answer": ["0.5"]})
# result = True
# Evaluate GSM8K
result = eval_last_single_answer({"prediction": "42", "answer": "42"})
# result = True