Principle:SqueezeAILab ETS Evaluation Reporting
| Knowledge Sources | |
|---|---|
| Domains | Evaluation, Accuracy_Measurement |
| Last Updated | 2026-02-14 02:00 GMT |
Overview
An evaluation framework that computes accuracy from tree search results using either best-of-n selection or weighted majority voting.
Description
After tree search produces candidate answers for each question, the evaluation phase determines final accuracy using two complementary strategies:
Best-of-N Selection (evaluate): For each question, select the candidate with the highest aggregated step score. Extract its answer and grade against ground truth. This measures how well the PRM can identify the best trajectory.
Weighted Majority Voting (majority_vote):
For each question, group candidates into equivalence classes (using grade_answer for equality testing). Sum the weights of candidates in each class. Select the answer with the highest total weight. This leverages diversity of candidates and is more robust than best-of-n when the PRM is noisy.
The evaluation pipeline is run via math_evaluate.py CLI or the scripts/evaluate.sh runner script, which iterates over multiple search widths.
Usage
Run evaluation after tree search completes and results are saved to answers.json. Choose between best-of-n (for single aggregation strategies) or majority voting (for ensemble-based evaluation). The default evaluation script uses weighted majority voting with agg_last weights.
Theoretical Basis
The two evaluation strategies represent different approaches to answer selection:
# Best-of-N: Select highest-scoring trajectory
def best_of_n(candidates, agg_func):
best = max(candidates, key=lambda c: agg_func(c.step_scores))
return extract_answer(best.text)
# Weighted Majority Vote: Weighted consensus
def majority_vote(candidates, weight_func):
equiv_classes = group_by_equivalence(candidates, grade_answer)
for cls in equiv_classes:
cls.total_weight = sum(weight_func(c.step_scores) for c in cls)
return max(equiv_classes, key=lambda cls: cls.total_weight).answer
Majority voting is generally more robust because it aggregates information across multiple candidates, reducing the impact of a single high-scoring but incorrect trajectory. The weighted variant further leverages PRM scores to prioritize high-quality candidates within each equivalence class.