Implementation:SqueezeAILab ETS Grade Answer

Knowledge Sources	ETS Hendrycks MATH
Domains	Evaluation, Mathematical_Reasoning, String_Normalization
Last Updated	2026-02-14 02:00 GMT

Overview

Concrete tool for determining mathematical answer correctness via two-stage normalization and string comparison, provided by grader.py.

Description

The grade_answer() function implements the full grading pipeline:

Apply Hendrycks MATH normalization via math_normalize.normalize_answer()
Compare normalized strings (early return if equal)
Apply aggressive normalization via _normalize()
Compare aggressively normalized strings
For tuple/interval answers: split via split_tuple() and compare element-wise
For fractions: require exact match (no simplification)
Sympy symbolic equality check is defined (are_equal_under_sympy) but currently disabled (returns False)

Usage

Called from evaluate() and majority_vote() in math_evaluate.py. Also used internally by majority_vote() to determine answer equivalence classes.

Code Reference

Source Location

Repository: ETS
File: evaluate/evaluate_utils/grader.py
Lines: 235-289 (grade_answer), 106-176 (_normalize), 8-19 (normalize_answer in math_normalize.py)

Signature

def grade_answer(given_answer: str, ground_truth: str) -> bool:
    """
    Determine if a given answer matches the ground truth.

    Two-stage normalization:
    1. Hendrycks MATH normalization (math_normalize.normalize_answer)
    2. Aggressive normalization (_normalize) with LaTeX parsing, unit removal, etc.

    Special cases:
    - Tuple/interval answers: split and compare element-wise
    - Fractions: require exact match
    - Sympy equality: defined but currently disabled

    Args:
        given_answer (str): Model's extracted answer
        ground_truth (str): Reference answer from dataset

    Returns:
        bool: True if answers are equivalent, False otherwise
    """

Import

from evaluate.evaluate_utils.grader import grade_answer

I/O Contract

Inputs

Name	Type	Required	Description
given_answer	str	Yes	Model's extracted answer string
ground_truth	str	Yes	Reference answer from dataset

Outputs

Name	Type	Description
is_correct	bool	True if the given answer matches ground truth after normalization

Usage Examples

Basic Grading

from evaluate.evaluate_utils.grader import grade_answer

# Direct match
print(grade_answer("42", "42"))  # True

# LaTeX equivalence
print(grade_answer("\\frac{1}{2}", "0.5"))  # True (via math_normalize)

# Different formats
print(grade_answer("1/2", "\\frac{1}{2}"))  # True (_fix_a_slash_b normalizes)

# Unit removal
print(grade_answer("5 meters", "5"))  # True (aggressive normalization removes units)

# Incorrect answer
print(grade_answer("43", "42"))  # False

# None handling
print(grade_answer(None, "42"))  # False

Tuple Comparison

# Tuple answers compared element-wise
print(grade_answer("(1, 2)", "(1, 2)"))  # True
print(grade_answer("(2, 1)", "(1, 2)"))  # False (order matters)
print(grade_answer("[1, 2]", "[1, 2]"))  # True (brackets preserved)

Used in Evaluation Pipeline

# In evaluate() function
for qapair in data:
    answer = extract_function(best_candidate["text"])
    ground_truth = qapair["ground_truth_answer"]
    if grade_answer(answer, ground_truth):
        num_correct += 1

Related Pages

Implements Principle

Principle:SqueezeAILab_ETS_Answer_Normalization_And_Grading

Requires Environment

Environment:SqueezeAILab_ETS_Evaluation_Python_Stack

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment