Principle:Iamhankai Forest of Thought Monte Carlo Tree Search

Knowledge Sources	Forest-of-Thought MCTS Survey AlphaGo
Domains	Search_Algorithms, Reasoning, Reinforcement_Learning
Last Updated	2026-02-14 03:00 GMT

Overview

A tree search algorithm that uses Upper Confidence Bounds (UCB) to balance exploration and exploitation when iteratively refining LLM-generated answers.

Description

Monte Carlo Tree Search (MCTS) in the Forest-of-Thought context applies the classic MCTS algorithm to LLM reasoning. Instead of game states, nodes represent answer candidates. The algorithm iteratively selects the most promising node via UCB scoring, expands it by generating a refined answer through the LLM, evaluates the new answer via reward sampling, and backpropagates reward signals to update UCB values. This produces progressively better answers through structured exploration of the reasoning space.

Key components:

Selection: UCB-based node selection balances exploring new paths vs. exploiting high-reward nodes
Expansion: LLM generates a "weak hint" from the selected node, then produces a "better answer" incorporating that hint
Reward estimation: Multiple reward samples are averaged to score each new answer
Backpropagation: Rewards update UCB values for ancestor nodes

Usage

Use this principle when the base search mode is set to mcts. MCTS is the default and most powerful base reasoning mode in FoT, offering iterative refinement that improves answer quality over multiple iterations. It is preferred for difficult reasoning tasks (MATH, AIME) where single-pass generation is insufficient.

Theoretical Basis

The UCB1 formula balances exploration and exploitation:

$U C B_{i} = {\bar{X}}_{i} + C \sqrt{\frac{\ln N}{n_{i}}}$

Where:

${\bar{X}}_{i}$ = average reward of node i
C = exploration constant
N = total visits to parent
$n_{i}$ = visits to node i

MCTS cycle (pseudo-code):

# Abstract MCTS for LLM reasoning
for iteration in range(max_iter):
    node = select_by_ucb(tree)           # Selection
    hint = get_weak_hints(node.answer)    # Expansion: generate hint
    new_answer = get_better_answer(hint)  # Expansion: refine
    reward = cal_reward(new_answer)       # Simulation: score
    backpropagate(node, reward)           # Backpropagation

The algorithm converges to high-quality answers because:

UCB ensures all promising branches are explored
Reward sampling provides accurate quality estimates
Iterative refinement builds on previous best answers

Related Pages

Implemented By

Implementation:Iamhankai_Forest_of_Thought_Monte_Carlo_Tree

Uses Heuristic

Heuristic:Iamhankai_Forest_of_Thought_UCB_Exploration_Constant

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment