Implementation:SqueezeAILab ETS Grade Answer
Appearance
| Knowledge Sources | |
|---|---|
| Domains | Evaluation, Mathematical_Reasoning, String_Normalization |
| Last Updated | 2026-02-14 02:00 GMT |
Overview
Concrete tool for determining mathematical answer correctness via two-stage normalization and string comparison, provided by grader.py.
Description
The grade_answer() function implements the full grading pipeline:
- Apply Hendrycks MATH normalization via
math_normalize.normalize_answer() - Compare normalized strings (early return if equal)
- Apply aggressive normalization via
_normalize() - Compare aggressively normalized strings
- For tuple/interval answers: split via
split_tuple()and compare element-wise - For fractions: require exact match (no simplification)
- Sympy symbolic equality check is defined (
are_equal_under_sympy) but currently disabled (returns False)
Usage
Called from evaluate() and majority_vote() in math_evaluate.py. Also used internally by majority_vote() to determine answer equivalence classes.
Code Reference
Source Location
- Repository: ETS
- File: evaluate/evaluate_utils/grader.py
- Lines: 235-289 (grade_answer), 106-176 (_normalize), 8-19 (normalize_answer in math_normalize.py)
Signature
def grade_answer(given_answer: str, ground_truth: str) -> bool:
"""
Determine if a given answer matches the ground truth.
Two-stage normalization:
1. Hendrycks MATH normalization (math_normalize.normalize_answer)
2. Aggressive normalization (_normalize) with LaTeX parsing, unit removal, etc.
Special cases:
- Tuple/interval answers: split and compare element-wise
- Fractions: require exact match
- Sympy equality: defined but currently disabled
Args:
given_answer (str): Model's extracted answer
ground_truth (str): Reference answer from dataset
Returns:
bool: True if answers are equivalent, False otherwise
"""
Import
from evaluate.evaluate_utils.grader import grade_answer
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| given_answer | str | Yes | Model's extracted answer string |
| ground_truth | str | Yes | Reference answer from dataset |
Outputs
| Name | Type | Description |
|---|---|---|
| is_correct | bool | True if the given answer matches ground truth after normalization |
Usage Examples
Basic Grading
from evaluate.evaluate_utils.grader import grade_answer
# Direct match
print(grade_answer("42", "42")) # True
# LaTeX equivalence
print(grade_answer("\\frac{1}{2}", "0.5")) # True (via math_normalize)
# Different formats
print(grade_answer("1/2", "\\frac{1}{2}")) # True (_fix_a_slash_b normalizes)
# Unit removal
print(grade_answer("5 meters", "5")) # True (aggressive normalization removes units)
# Incorrect answer
print(grade_answer("43", "42")) # False
# None handling
print(grade_answer(None, "42")) # False
Tuple Comparison
# Tuple answers compared element-wise
print(grade_answer("(1, 2)", "(1, 2)")) # True
print(grade_answer("(2, 1)", "(1, 2)")) # False (order matters)
print(grade_answer("[1, 2]", "[1, 2]")) # True (brackets preserved)
Used in Evaluation Pipeline
# In evaluate() function
for qapair in data:
answer = extract_function(best_candidate["text"])
ground_truth = qapair["ground_truth_answer"]
if grade_answer(answer, ground_truth):
num_correct += 1
Related Pages
Implements Principle
Requires Environment
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment