Principle:Iamhankai Forest of Thought Monte Carlo Tree Search
| Knowledge Sources | |
|---|---|
| Domains | Search_Algorithms, Reasoning, Reinforcement_Learning |
| Last Updated | 2026-02-14 03:00 GMT |
Overview
A tree search algorithm that uses Upper Confidence Bounds (UCB) to balance exploration and exploitation when iteratively refining LLM-generated answers.
Description
Monte Carlo Tree Search (MCTS) in the Forest-of-Thought context applies the classic MCTS algorithm to LLM reasoning. Instead of game states, nodes represent answer candidates. The algorithm iteratively selects the most promising node via UCB scoring, expands it by generating a refined answer through the LLM, evaluates the new answer via reward sampling, and backpropagates reward signals to update UCB values. This produces progressively better answers through structured exploration of the reasoning space.
Key components:
- Selection: UCB-based node selection balances exploring new paths vs. exploiting high-reward nodes
- Expansion: LLM generates a "weak hint" from the selected node, then produces a "better answer" incorporating that hint
- Reward estimation: Multiple reward samples are averaged to score each new answer
- Backpropagation: Rewards update UCB values for ancestor nodes
Usage
Use this principle when the base search mode is set to mcts. MCTS is the default and most powerful base reasoning mode in FoT, offering iterative refinement that improves answer quality over multiple iterations. It is preferred for difficult reasoning tasks (MATH, AIME) where single-pass generation is insufficient.
Theoretical Basis
The UCB1 formula balances exploration and exploitation:
Where:
- = average reward of node i
- C = exploration constant
- N = total visits to parent
- = visits to node i
MCTS cycle (pseudo-code):
# Abstract MCTS for LLM reasoning
for iteration in range(max_iter):
node = select_by_ucb(tree) # Selection
hint = get_weak_hints(node.answer) # Expansion: generate hint
new_answer = get_better_answer(hint) # Expansion: refine
reward = cal_reward(new_answer) # Simulation: score
backpropagate(node, reward) # Backpropagation
The algorithm converges to high-quality answers because:
- UCB ensures all promising branches are explored
- Reward sampling provides accurate quality estimates
- Iterative refinement builds on previous best answers