Heuristic:Iamhankai Forest of Thought UCB Exploration Constant
| Knowledge Sources | |
|---|---|
| Domains | Optimization, LLMs, Tree_Search |
| Last Updated | 2026-02-14 03:30 GMT |
Overview
UCB exploration-exploitation tuning using C=1.4 and gamma=0.85 for balancing tree search breadth against exploitation depth in the MCTS reasoning forest.
Description
The Forest-of-Thought MCTS implementation uses the Upper Confidence Bound (UCB) formula to select which reasoning node to expand next. The exploration constant C=1.4 controls the exploration-exploitation trade-off: higher values favor exploring less-visited nodes, while lower values favor exploiting nodes with higher known rewards. The gamma=0.85 discount factor (declared but integrated into the reward averaging logic) controls how parent node values are blended with child rewards.
The reward for each node is computed as the average of the minimum reward and the mean reward: `(min(rewards) + mean(rewards)) / 2`. This pessimistic averaging strategy guards against overly optimistic reward estimates from single lucky evaluations.
Usage
Use these constants when running the MCTS-based Forest-of-Thought benchmark evaluation. The values C=1.4 and the pessimistic reward averaging are hardcoded defaults. If the search is getting stuck on a single reasoning path (under-exploration), consider increasing C. If the search is too scattered and not converging, consider decreasing C.
The Insight (Rule of Thumb)
- Action: Use `C=1.4` in the UCB formula `r_c + C * sqrt(log(N_parent + 1) / (N_child + 1e-5))` for MCTS node selection.
- Value: C=1.4, gamma=0.85 (hardcoded in `update_ucb`).
- Trade-off: C=1.4 is slightly higher than the theoretical optimum of sqrt(2) ≈ 1.414, favoring exploration. This is appropriate for LLM reasoning where diverse reasoning paths often yield better answers than deep exploitation of a single chain.
- Reward Formula: Node reward is `(min(rewards) + mean(rewards)) / 2`, a pessimistic estimate that avoids over-reliance on outlier high scores from the LLM self-judge.
Reasoning
The UCB1 algorithm guarantees logarithmic regret with C = sqrt(2) in the standard multi-armed bandit setting. However, in LLM reasoning trees, the reward signal comes from a self-judging LLM (scoring 0-100), which is inherently noisy. The pessimistic reward formula `(min + mean) / 2` compensates for this noise by weighting toward the worst observed outcome. Combined with C=1.4, this encourages the search to thoroughly explore alternative reasoning paths while still prioritizing nodes that consistently produce good answers across multiple evaluations.
The `1e-5` epsilon in the denominator prevents division by zero for unvisited nodes, while `log(N_parent + 1)` ensures the exploration term is well-defined even when the parent has only been visited once.
Code Evidence
UCB computation from `run_with_mcf_stop_noearly.py:L168-169`:
def compute_ucb(r_c, N_n, N_c, C):
return r_c + C * math.sqrt(math.log(N_n + 1) / (N_c + 1e-5))
UCB update with C=1.4, gamma=0.85 from `run_with_mcf_stop_noearly.py:L171-177`:
def update_ucb(fathers, childs, to_explore, to_explore_reward, ucb_bank, C=1.4, gamma=0.85):
visit_count = {node: len(to_explore_reward[node]) for node in to_explore}
avg_reward = {node: (min(to_explore_reward[node]) + np.mean(to_explore_reward[node])) / 2
for node in to_explore}
Weighted node selection from `run_with_mcf_stop_noearly.py:L213-221`:
def get_tree_ans(to_explore_reward, ucb_bank, answers_list):
def weighted_score(answer):
reward_score = min(to_explore_reward[answer]) * 0.5 # 50% reward
visit_count = len(to_explore_reward[answer]) * 0.3 # 30% visits
ucb_score = ucb_bank.get(answer, 0) * 0.2 # 20% UCB
return reward_score + visit_count + ucb_score