Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Iamhankai Forest of Thought Monte Carlo Tree Search

From Leeroopedia
Knowledge Sources
Domains Search_Algorithms, Reasoning, Reinforcement_Learning
Last Updated 2026-02-14 03:00 GMT

Overview

A tree search algorithm that uses Upper Confidence Bounds (UCB) to balance exploration and exploitation when iteratively refining LLM-generated answers.

Description

Monte Carlo Tree Search (MCTS) in the Forest-of-Thought context applies the classic MCTS algorithm to LLM reasoning. Instead of game states, nodes represent answer candidates. The algorithm iteratively selects the most promising node via UCB scoring, expands it by generating a refined answer through the LLM, evaluates the new answer via reward sampling, and backpropagates reward signals to update UCB values. This produces progressively better answers through structured exploration of the reasoning space.

Key components:

  • Selection: UCB-based node selection balances exploring new paths vs. exploiting high-reward nodes
  • Expansion: LLM generates a "weak hint" from the selected node, then produces a "better answer" incorporating that hint
  • Reward estimation: Multiple reward samples are averaged to score each new answer
  • Backpropagation: Rewards update UCB values for ancestor nodes

Usage

Use this principle when the base search mode is set to mcts. MCTS is the default and most powerful base reasoning mode in FoT, offering iterative refinement that improves answer quality over multiple iterations. It is preferred for difficult reasoning tasks (MATH, AIME) where single-pass generation is insufficient.

Theoretical Basis

The UCB1 formula balances exploration and exploitation:

UCBi=X¯i+ClnNni

Where:

  • X¯i = average reward of node i
  • C = exploration constant
  • N = total visits to parent
  • ni = visits to node i

MCTS cycle (pseudo-code):

# Abstract MCTS for LLM reasoning
for iteration in range(max_iter):
    node = select_by_ucb(tree)           # Selection
    hint = get_weak_hints(node.answer)    # Expansion: generate hint
    new_answer = get_better_answer(hint)  # Expansion: refine
    reward = cal_reward(new_answer)       # Simulation: score
    backpropagate(node, reward)           # Backpropagation

The algorithm converges to high-quality answers because:

  1. UCB ensures all promising branches are explored
  2. Reward sampling provides accurate quality estimates
  3. Iterative refinement builds on previous best answers

Related Pages

Implemented By

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment