Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:SqueezeAILab ETS Grade Answer

From Leeroopedia
Knowledge Sources
Domains Evaluation, Mathematical_Reasoning, String_Normalization
Last Updated 2026-02-14 02:00 GMT

Overview

Concrete tool for determining mathematical answer correctness via two-stage normalization and string comparison, provided by grader.py.

Description

The grade_answer() function implements the full grading pipeline:

  1. Apply Hendrycks MATH normalization via math_normalize.normalize_answer()
  2. Compare normalized strings (early return if equal)
  3. Apply aggressive normalization via _normalize()
  4. Compare aggressively normalized strings
  5. For tuple/interval answers: split via split_tuple() and compare element-wise
  6. For fractions: require exact match (no simplification)
  7. Sympy symbolic equality check is defined (are_equal_under_sympy) but currently disabled (returns False)

Usage

Called from evaluate() and majority_vote() in math_evaluate.py. Also used internally by majority_vote() to determine answer equivalence classes.

Code Reference

Source Location

  • Repository: ETS
  • File: evaluate/evaluate_utils/grader.py
  • Lines: 235-289 (grade_answer), 106-176 (_normalize), 8-19 (normalize_answer in math_normalize.py)

Signature

def grade_answer(given_answer: str, ground_truth: str) -> bool:
    """
    Determine if a given answer matches the ground truth.

    Two-stage normalization:
    1. Hendrycks MATH normalization (math_normalize.normalize_answer)
    2. Aggressive normalization (_normalize) with LaTeX parsing, unit removal, etc.

    Special cases:
    - Tuple/interval answers: split and compare element-wise
    - Fractions: require exact match
    - Sympy equality: defined but currently disabled

    Args:
        given_answer (str): Model's extracted answer
        ground_truth (str): Reference answer from dataset

    Returns:
        bool: True if answers are equivalent, False otherwise
    """

Import

from evaluate.evaluate_utils.grader import grade_answer

I/O Contract

Inputs

Name Type Required Description
given_answer str Yes Model's extracted answer string
ground_truth str Yes Reference answer from dataset

Outputs

Name Type Description
is_correct bool True if the given answer matches ground truth after normalization

Usage Examples

Basic Grading

from evaluate.evaluate_utils.grader import grade_answer

# Direct match
print(grade_answer("42", "42"))  # True

# LaTeX equivalence
print(grade_answer("\\frac{1}{2}", "0.5"))  # True (via math_normalize)

# Different formats
print(grade_answer("1/2", "\\frac{1}{2}"))  # True (_fix_a_slash_b normalizes)

# Unit removal
print(grade_answer("5 meters", "5"))  # True (aggressive normalization removes units)

# Incorrect answer
print(grade_answer("43", "42"))  # False

# None handling
print(grade_answer(None, "42"))  # False

Tuple Comparison

# Tuple answers compared element-wise
print(grade_answer("(1, 2)", "(1, 2)"))  # True
print(grade_answer("(2, 1)", "(1, 2)"))  # False (order matters)
print(grade_answer("[1, 2]", "[1, 2]"))  # True (brackets preserved)

Used in Evaluation Pipeline

# In evaluate() function
for qapair in data:
    answer = extract_function(best_candidate["text"])
    ground_truth = qapair["ground_truth_answer"]
    if grade_answer(answer, ground_truth):
        num_correct += 1

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment