Heuristic:SqueezeAILab ETS Lambda Similarity Scaling
| Knowledge Sources | |
|---|---|
| Domains | Optimization, LLMs |
| Last Updated | 2026-02-14 02:30 GMT |
Overview
Scale the diversity penalty (`lambdas`) inversely by the number of discovered clusters to normalize the similarity term in the ILP objective, and tune `lambdas` per search width (1.5 for width=16, 1.4 for width=64, 1.45 for width=256).
Description
The ETS ILP objective function includes a similarity term that rewards selecting nodes from diverse trajectory clusters. The raw `lambdas` coefficient is divided by the number of clusters `K` discovered through hierarchical clustering (`lambdas /= K`). This normalization prevents the similarity term from dominating the objective when many clusters exist, keeping it balanced against the reward score term regardless of the clustering outcome.
Additionally, the `lambdas` value itself is tuned per search width: 1.5 for width=16, 1.4 for width=64, and 1.45 for width=256. This non-monotonic pattern suggests empirical tuning for each width configuration.
Usage
Apply this heuristic when configuring the `lambdas` hyperparameter for `softmax_costmodel` selection. The cluster-count normalization is automatic in the code. The per-width `lambdas` values should be used as starting points and may need re-tuning for different models or datasets.
The Insight (Rule of Thumb)
- Action: Set `lambdas` in the YAML config to control diversity encouragement. The code automatically divides by the number of clusters.
- Value: `lambdas=1.5` (width=16), `lambdas=1.4` (width=64), `lambdas=1.45` (width=256).
- Trade-off: Higher `lambdas` forces the ILP to select more diverse trajectories at the potential cost of excluding high-reward but similar nodes. Lower `lambdas` prioritizes reward scores, potentially selecting redundant trajectories.
Reasoning
Without cluster-count normalization, the similarity term in the objective would scale linearly with the number of clusters: `lambdas * sum(coverage[k])`. If there are 20 clusters, the maximum contribution is `20 * lambdas`, which could easily dominate the reward term. Dividing by K ensures the maximum contribution of the similarity term is always `lambdas`, regardless of how many clusters emerge.
The per-width tuning reflects the different dynamics at each scale: at width=16, fewer trajectories mean each diversity bonus matters more (higher lambdas). At width=64, the larger pool naturally contains more diversity, so the penalty is slightly lower. The slight increase at width=256 may compensate for the larger search space making redundancy more likely again.
The `lambdac` parameter (cost coefficient) is set to 1.0 across all configs, while the code applies a sign flip (`lambdac = -self.lambdac` if nonzero, else `1e-4` as guard) to make it a cost penalty in the objective.
Code Evidence
Cluster-count normalization from `rebase.py:486`:
# divide lambdas by num_clusters to scale term appropriately
lambdas /= K
Clustering and coverage variable setup from `rebase.py:474-483`:
# clustering sequences
Z = linkage(embeddings, method='average', metric='cosine')
clusters = fcluster(Z, 0.05, criterion='distance') - 1 # subtract to get 0-indexed labels
K = len(np.unique(clusters))
# link coverage variables to cluster selections
coverage = [LpVariable(f"coverage_{i}", cat="Binary") for i in range(K)]
for k in range(K):
cluster_indices = [i for i, c in enumerate(clusters) if c == k]
problem += lpSum(x[i] for i in cluster_indices) >= coverage[k], f"Coverage_Lower_{k}"
Per-width lambdas values:
# ets_16_math500.yaml
lambdas: 1.5
# ets_64_math500.yaml
lambdas: 1.4
# ets_256_math500.yaml
lambdas: 1.45
Lambda cost guard from `rebase.py:404-407`:
if self.lambdac == 0: # guard
lambdac = 1e-4
else:
lambdac = -self.lambdac