Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:SqueezeAILab ETS Evaluation Reporting

From Leeroopedia
Knowledge Sources
Domains Evaluation, Accuracy_Measurement
Last Updated 2026-02-14 02:00 GMT

Overview

An evaluation framework that computes accuracy from tree search results using either best-of-n selection or weighted majority voting.

Description

After tree search produces candidate answers for each question, the evaluation phase determines final accuracy using two complementary strategies:

Best-of-N Selection (evaluate): For each question, select the candidate with the highest aggregated step score. Extract its answer and grade against ground truth. This measures how well the PRM can identify the best trajectory.

Weighted Majority Voting (majority_vote): For each question, group candidates into equivalence classes (using grade_answer for equality testing). Sum the weights of candidates in each class. Select the answer with the highest total weight. This leverages diversity of candidates and is more robust than best-of-n when the PRM is noisy.

The evaluation pipeline is run via math_evaluate.py CLI or the scripts/evaluate.sh runner script, which iterates over multiple search widths.

Usage

Run evaluation after tree search completes and results are saved to answers.json. Choose between best-of-n (for single aggregation strategies) or majority voting (for ensemble-based evaluation). The default evaluation script uses weighted majority voting with agg_last weights.

Theoretical Basis

The two evaluation strategies represent different approaches to answer selection:

# Best-of-N: Select highest-scoring trajectory
def best_of_n(candidates, agg_func):
    best = max(candidates, key=lambda c: agg_func(c.step_scores))
    return extract_answer(best.text)

# Weighted Majority Vote: Weighted consensus
def majority_vote(candidates, weight_func):
    equiv_classes = group_by_equivalence(candidates, grade_answer)
    for cls in equiv_classes:
        cls.total_weight = sum(weight_func(c.step_scores) for c in cls)
    return max(equiv_classes, key=lambda cls: cls.total_weight).answer

Majority voting is generally more robust because it aggregates information across multiple candidates, reducing the impact of a single high-scoring but incorrect trajectory. The weighted variant further leverages PRM scores to prioritize high-quality candidates within each equivalence class.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment