Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:SqueezeAILab ETS Evaluate And Majority Vote

From Leeroopedia
Knowledge Sources
Domains Evaluation, Accuracy_Measurement
Last Updated 2026-02-14 02:00 GMT

Overview

Concrete tool for computing accuracy via best-of-n selection or weighted majority voting, provided by math_evaluate.py.

Description

Two evaluation functions compute accuracy over tree search results:

  • evaluate(): Best-of-n selection — for each question, picks the candidate with highest aggregated score, extracts its answer, and grades against ground truth
  • majority_vote(): Weighted voting — for each question, groups candidates into equivalence classes using grade_answer(), sums weights, and selects the highest-weighted class

Both return accuracy as a float (num_correct / total).

Usage

Called from main() in math_evaluate.py based on the --agg_func CLI argument. For majority voting, additionally requires --weighted and --weight_agg arguments.

Code Reference

Source Location

  • Repository: ETS
  • File: math_evaluate.py
  • Lines: 91-108 (evaluate), 110-143 (majority_vote), 147-191 (main with CLI)

Signature

def evaluate(path, aggfunc, extract_function):
    """
    Best-of-N evaluation: select highest-scoring candidate per question.

    Args:
        path (str): Path to answers.json file
        aggfunc (Callable): Score aggregation function (agg_min/mean/prod/last)
        extract_function (Callable): Answer extraction function

    Returns:
        float: Accuracy (num_correct / total)
    """

def majority_vote(path, weighted, weight_func, extract_function):
    """
    Weighted majority voting evaluation.

    Args:
        path (str): Path to answers.json file
        weighted (bool): Whether to weight votes by step scores
        weight_func (Callable): Weight aggregation function (e.g., agg_last)
        extract_function (Callable): Answer extraction function

    Returns:
        float: Accuracy (num_correct / total)
    """

Import

# Defined in math_evaluate.py
# CLI entry point:
# python3 math_evaluate.py --path answers.json --agg_func majority_vote \
#     --model_type llemma --weighted True --weight_agg last

I/O Contract

Inputs

Name Type Required Description
path str Yes Path to answers.json from tree search output
aggfunc Callable Yes (evaluate) Score aggregation function
extract_function Callable Yes Answer extraction function for the model type
weighted bool Yes (majority_vote) Whether to weight votes
weight_func Callable Yes (majority_vote, if weighted) Weight aggregation function

Outputs

Name Type Description
accuracy float Fraction of correctly answered questions (num_correct / total)
stdout text Prints num_correct count
output_path (optional) File Appends accuracy to text file if --output_path specified

Usage Examples

Best-of-N Evaluation

# Evaluate using best-of-n with last step score
accuracy = evaluate(
    path="exp_results/ets_16_math500/answers.json",
    aggfunc=agg_last,
    extract_function=extract_shepherd_answer,
)
print(f"Accuracy: {accuracy:.4f}")

Weighted Majority Voting

# Evaluate using weighted majority voting
accuracy = majority_vote(
    path="exp_results/ets_16_math500/answers.json",
    weighted=True,
    weight_func=agg_last,
    extract_function=extract_shepherd_answer,
)
print(f"Accuracy: {accuracy:.4f}")

CLI Usage

# Best-of-n with min aggregation
python3 math_evaluate.py --path exp_results/ets_16_math500/answers.json \
    --agg_func min --model_type llemma

# Weighted majority voting (default in evaluate.sh)
python3 math_evaluate.py --path exp_results/ets_16_math500/answers.json \
    --agg_func majority_vote --model_type llemma \
    --weighted True --weight_agg last \
    --output_path exp_results/ets_16_math500/results_vote_last.txt

Sweep Evaluation Script

# From scripts/evaluate.sh - evaluates across multiple widths
for WIDTH in 16 64 256; do
    export path="exp_results/ets_${WIDTH}_math500/"
    python3 ./math_evaluate.py --path $path/answers.json \
        --agg_func majority_vote \
        --output_path $path/results_vote_last.txt \
        --model_type llemma \
        --weighted True \
        --weight_agg last
done

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment