Workflow:SqueezeAILab ETS Answer Evaluation Pipeline
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Evaluation, Math_Reasoning |
| Last Updated | 2025-02-14 00:00 GMT |
Overview
End-to-end process for evaluating tree search outputs by extracting candidate answers, aggregating per-step scores, grading against ground truth, and reporting accuracy on math benchmarks.
Description
This workflow takes the raw JSON output from ETS tree search and computes accuracy metrics. It supports two selection strategies: best-of-n (pick the single trajectory with the highest aggregated score) and weighted majority voting (group equivalent answers, weight by aggregated PRM scores, pick the most-voted answer). The grading system uses multi-stage normalization to handle varied answer formats including LaTeX, boxed expressions, fractions, and plain numbers.
Goal: Produce an accuracy score (fraction of correctly solved problems) from the tree search output.
Scope: Covers answer extraction from model-generated text, score aggregation across solution steps, answer grading against ground truth, and accuracy reporting.
Strategy: Uses a pipeline of answer extraction (regex-based with multiple fallback patterns), string normalization (both MATH-compatible and aggressive), and optional symbolic equality checking. Supports configurable score aggregation functions (last, mean, product, minimum) for weighting trajectories.
Usage
Execute this workflow after completing the ETS Experiment Pipeline to compute accuracy on the benchmark. You need the answers.json output file from tree search and the ground truth answers embedded within it. No GPU is required for evaluation.
Execution Steps
Step 1: Select aggregation strategy
Choose how to select the final answer from the set of candidate trajectories produced by tree search. The two main strategies are best-of-n (select the trajectory with the highest aggregated step score) and weighted majority voting (group equivalent answers and weight each vote by its aggregated score). Also choose the score aggregation function that combines per-step PRM scores into a single trajectory score.
Available aggregation functions:
- last: Use only the final step's PRM score
- mean: Average all step scores
- prod: Multiply all step scores together
- min: Use the minimum step score across the trajectory
Step 2: Extract answers from trajectories
Parse the model-generated text of each candidate trajectory to extract the final answer. The extraction system uses a cascade of pattern matchers: first checking for "final answer is $...$" patterns, then LaTeX boxed expressions, then "the answer is" phrases, then program output blocks, and finally falling back to the last numeric value in the text. Extracted answers are stripped and normalized.
Key patterns matched:
- "final answer is $X$. I hope" (instruction-tuned Llama format)
- \boxed{X} (LaTeX boxed answers)
- "The answer is: X ки" (Shepherd/Llemma step format)
- Last numeric value as ultimate fallback
Step 3: Normalize and grade answers
Compare each extracted answer against the ground truth using a two-stage normalization pipeline. The first stage applies MATH-benchmark-compatible normalization (Hendrycks): stripping LaTeX formatting, fixing fractions, removing units and dollar signs. If the normalized strings match, the answer is correct. Otherwise, a second aggressive normalization pass is applied that handles mixed numbers, implicit multiplication, case insensitivity, and tuple decomposition.
Normalization operations:
- Strip LaTeX commands (\text{}, \frac, \sqrt, etc.)
- Remove units, currency symbols, and percentage signs
- Convert fractions and mixed numbers to canonical form
- Lowercase all text for case-insensitive comparison
- Parse tuples and intervals for element-wise comparison
Step 4: Aggregate and report accuracy
For best-of-n: select the trajectory with the highest aggregated score per problem, extract and grade its answer, and count correct answers across the dataset. For weighted majority voting: group trajectories by equivalent answers, sum their aggregated scores as weights, select the answer class with the highest total weight, and grade it. Report accuracy as the fraction of correctly answered problems, optionally writing results to a text file.
Output:
- Number of correct answers (printed to stdout)
- Accuracy as a decimal fraction
- Optional text file with aggregation settings and accuracy