Heuristic:SqueezeAILab ETS Lambda Similarity Scaling

Knowledge Sources	SqueezeAILab ETS ETS: Efficient Tree Search for Inference-Time Scaling
Domains	Optimization, LLMs
Last Updated	2026-02-14 02:30 GMT

Overview

Scale the diversity penalty (`lambdas`) inversely by the number of discovered clusters to normalize the similarity term in the ILP objective, and tune `lambdas` per search width (1.5 for width=16, 1.4 for width=64, 1.45 for width=256).

Description

The ETS ILP objective function includes a similarity term that rewards selecting nodes from diverse trajectory clusters. The raw `lambdas` coefficient is divided by the number of clusters `K` discovered through hierarchical clustering (`lambdas /= K`). This normalization prevents the similarity term from dominating the objective when many clusters exist, keeping it balanced against the reward score term regardless of the clustering outcome.

Additionally, the `lambdas` value itself is tuned per search width: 1.5 for width=16, 1.4 for width=64, and 1.45 for width=256. This non-monotonic pattern suggests empirical tuning for each width configuration.

Usage

Apply this heuristic when configuring the `lambdas` hyperparameter for `softmax_costmodel` selection. The cluster-count normalization is automatic in the code. The per-width `lambdas` values should be used as starting points and may need re-tuning for different models or datasets.

The Insight (Rule of Thumb)

Action: Set `lambdas` in the YAML config to control diversity encouragement. The code automatically divides by the number of clusters.
Value: `lambdas=1.5` (width=16), `lambdas=1.4` (width=64), `lambdas=1.45` (width=256).
Trade-off: Higher `lambdas` forces the ILP to select more diverse trajectories at the potential cost of excluding high-reward but similar nodes. Lower `lambdas` prioritizes reward scores, potentially selecting redundant trajectories.

Reasoning

Without cluster-count normalization, the similarity term in the objective would scale linearly with the number of clusters: `lambdas * sum(coverage[k])`. If there are 20 clusters, the maximum contribution is `20 * lambdas`, which could easily dominate the reward term. Dividing by K ensures the maximum contribution of the similarity term is always `lambdas`, regardless of how many clusters emerge.

The per-width tuning reflects the different dynamics at each scale: at width=16, fewer trajectories mean each diversity bonus matters more (higher lambdas). At width=64, the larger pool naturally contains more diversity, so the penalty is slightly lower. The slight increase at width=256 may compensate for the larger search space making redundancy more likely again.

The `lambdac` parameter (cost coefficient) is set to 1.0 across all configs, while the code applies a sign flip (`lambdac = -self.lambdac` if nonzero, else `1e-4` as guard) to make it a cost penalty in the objective.

Code Evidence

Cluster-count normalization from `rebase.py:486`:

# divide lambdas by num_clusters to scale term appropriately
lambdas /= K

Clustering and coverage variable setup from `rebase.py:474-483`:

# clustering sequences
Z = linkage(embeddings, method='average', metric='cosine')
clusters = fcluster(Z, 0.05, criterion='distance') - 1 # subtract to get 0-indexed labels
K = len(np.unique(clusters))

# link coverage variables to cluster selections
coverage = [LpVariable(f"coverage_{i}", cat="Binary") for i in range(K)]
for k in range(K):
    cluster_indices = [i for i, c in enumerate(clusters) if c == k]
    problem += lpSum(x[i] for i in cluster_indices) >= coverage[k], f"Coverage_Lower_{k}"

Per-width lambdas values:

# ets_16_math500.yaml
lambdas: 1.5

# ets_64_math500.yaml
lambdas: 1.4

# ets_256_math500.yaml
lambdas: 1.45

Lambda cost guard from `rebase.py:404-407`:

if self.lambdac == 0: # guard
    lambdac = 1e-4
else:
    lambdac = -self.lambdac

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment