Implementation:Sail sg LongSpec Benchmark Eval Script

Knowledge Sources	Sail_sg_LongSpec DeepSeek Math
Domains	NLP, Evaluation, Mathematics
Last Updated	2026-02-14 05:00 GMT

Overview

Concrete tool for evaluating math benchmark predictions across 8+ benchmarks using unified correctness checking with per-benchmark answer parsing.

Description

The eval_script.py module provides evaluation functions for multiple math benchmarks: MATH (eval_math), GSM8K (eval_last_single_answer), AGIEval Gaokao MathCloze (eval_agieval_gaokao_math_cloze), AGIEval Gaokao MathQA (eval_agieval_gaokao_mathqa), SAT (eval_math_sat), MMLU-STEM (eval_mmlu_stem), OCW Courses (eval_ocwcourses), and MiniF2F-Isabelle (eval_minif2f_isabelle). Each function takes a prediction item dict with "prediction" and "answer" keys and returns a boolean correctness result.

Usage

Import the appropriate evaluation function for the target benchmark. These functions are referenced by DeepSeekMathCallBack via its eval_fns mapping to evaluate model outputs during post-processing.

Code Reference

Source Location

Repository: Sail_sg_LongSpec
File: longspec/train/data/deepseek_math_utils/eval_script.py
Lines: 1-182

Signature

def is_correct(item: dict, pred_key: str = 'prediction', prec: float = 1e-3) -> bool:
    """Core correctness check using math_equal with list/union support."""

def eval_math(item: dict, pred_key: str = 'prediction', prec: float = 1e-3) -> bool:
    """Evaluate MATH benchmark with deduplication and list answer support."""

def eval_last_single_answer(item: dict, pred_key: str = 'prediction', prec: float = 1e-3) -> bool:
    """Evaluate single string answer (used for GSM8K)."""

def eval_ocwcourses(item: dict, pred_key: str = 'prediction', prec: float = 1e-3) -> int:
    """Evaluate OCW Courses with numeric/equation/expression type detection."""

def eval_agieval_gaokao_math_cloze(item: dict, pred_key: str = 'prediction', prec: float = 1e-3) -> bool:
    """Evaluate AGIEval Gaokao math cloze with multi-answer parsing."""

def eval_agieval_gaokao_mathqa(item: dict, pred_key: str = 'prediction', prec: float = 1e-3) -> bool:
    """Evaluate AGIEval Gaokao math QA (multiple choice A/B/C/D)."""

def eval_math_sat(item: dict, pred_key: str = 'prediction', prec: float = 1e-3) -> bool:
    """Evaluate SAT math (case-insensitive string match)."""

Import

from data.deepseek_math_utils.eval_script import eval_math, eval_last_single_answer, eval_ocwcourses

I/O Contract

Inputs

Name	Type	Required	Description
item	dict	Yes	Dict with "prediction" and "answer" keys
pred_key	str	No	Key for prediction in item dict (default "prediction")
prec	float	No	Numerical precision tolerance (default 1e-3)

Outputs

Name	Type	Description
result	bool (or int for OCW)	True/1 if prediction matches answer

Usage Examples

from data.deepseek_math_utils.eval_script import eval_math, eval_last_single_answer

# Evaluate MATH benchmark
result = eval_math({"prediction": ["\\frac{1}{2}"], "answer": ["0.5"]})
# result = True

# Evaluate GSM8K
result = eval_last_single_answer({"prediction": "42", "answer": "42"})
# result = True

Related Pages

Environment:Sail_sg_LongSpec_Training_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment