Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:SqueezeAILab ETS Score Aggregation

From Leeroopedia
Knowledge Sources
Domains Evaluation, Statistical_Aggregation
Last Updated 2026-02-14 02:00 GMT

Overview

A family of aggregation functions that combine per-step PRM scores into a single trajectory-level score for candidate ranking and weighted voting.

Description

Each candidate answer produced by the ETS tree search has a list of per-step PRM scores (one score per reasoning step in the trajectory). To rank candidates or weight their votes, these step-level scores must be aggregated into a single value. The ETS evaluation pipeline provides four aggregation strategies:

  • Min (agg_min): Returns the minimum step score. This is the most conservative strategy, penalizing any weak reasoning step regardless of how strong the other steps are.
  • Mean (agg_mean): Returns the arithmetic mean of all step scores. Provides a balanced view of overall trajectory quality.
  • Product (agg_prod): Returns the product of all step scores. Approximates a joint probability interpretation where each step is an independent event.
  • Last (agg_last): Returns only the final step score. This is the default strategy used in the evaluation scripts, assuming the last step's score captures cumulative quality.

Usage

Select an aggregation strategy before running evaluation. The choice is specified via the --agg_func CLI argument in math_evaluate.py. The "last" strategy is used by default in the provided evaluation scripts. When using majority voting, the aggregation function serves as the weight function via --weight_agg.

Theoretical Basis

Given a trajectory with step scores s1,s2,,sn, the aggregation functions are:

  • Failed to parse (syntax error): {\displaystyle \text{agg\_min}(s) = \min_i s_i}
  • Failed to parse (syntax error): {\displaystyle \text{agg\_mean}(s) = \frac{1}{n} \sum_i s_i}
  • Failed to parse (syntax error): {\displaystyle \text{agg\_prod}(s) = \prod_i s_i}
  • Failed to parse (syntax error): {\displaystyle \text{agg\_last}(s) = s_n}

The choice of aggregation function reflects different assumptions about how step quality relates to trajectory quality:

  • Min assumes the trajectory is only as strong as its weakest step (bottleneck model)
  • Mean assumes each step contributes equally to overall quality
  • Product models steps as independent success probabilities
  • Last assumes the final step score captures cumulative quality (empirically strong for PRM-scored trajectories)

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment