Workflow:Iamhankai Forest of Thought Game24 Forest Solving

Knowledge Sources	Forest-of-Thought Forest-of-Thought: Scaling Test-Time Compute Tree-of-Thought LLM
Domains	LLM_Reasoning, Game_Solving, Tree_Search
Last Updated	2026-02-14 03:00 GMT

Overview

End-to-end process for solving Game of 24 puzzles using a Forest-of-Thought BFS-based tree search approach with multiple trees and self-correction.

Description

This workflow solves Game of 24 problems, where four input numbers must be combined using arithmetic operations to reach the value 24. It uses a multi-tree forest approach where each tree performs a structured Breadth-First Search (BFS) with three decomposition steps: reducing four numbers to three, then three to two, then computing the final expression. Each step involves LLM-based proposal generation with optional self-correction, value-based evaluation for pruning, and greedy or sample-based selection of the best partial solutions. Input diversity across trees is achieved by shuffling the order of the four input numbers. Solutions are validated by evaluating the generated arithmetic expression with sympy.

Usage

Execute this workflow when you want to evaluate an LLM's ability to solve Game of 24 puzzles using the Forest-of-Thought framework. This is a structured combinatorial reasoning task distinct from the open-ended math benchmarks. The workflow requires a local HuggingFace model (typically Llama-3-8B-Instruct) with GPU access. Use this for ablation studies on tree count, to compare BFS forest search against naive CoT baselines, or to reproduce the paper's Game24 experimental results.

Execution Steps

Step 1: Configuration and Model Loading

Parse command-line arguments specifying the model path, backend name, temperature, task index range, number of trees, search method (greedy or sample), evaluation sample count, and selection sample count. Load the local LLM using the Pipeline class, which wraps HuggingFace text-generation pipeline for the Game24 task with sampling enabled.

Key considerations:

The Game24 task always uses the text-generation pipeline mode (not direct model loading)
Temperature defaults to 0.7 for balanced exploration
The correction flag enables post-hoc self-correction of LLM proposals
Both naive (CoT baseline) and forest (BFS tree search) modes are supported via the naive_run flag

Step 2: Problem Initialization with Input Diversity

For each Game24 problem, retrieve the four input numbers from the dataset. Generate multiple permutations of these numbers, one per tree in the forest. The first tree uses the original number ordering, while subsequent trees use random shuffles to introduce input diversity across the forest.

Key considerations:

Input diversity is achieved by shuffling the four numbers, not by prompt variation
The number of permutations equals the configured tree count
Different orderings can lead the LLM to discover different solution paths
Problems are indexed from the 24-puzzle dataset (typically indices 0-96)

Step 3: BFS Tree Search (Three-Stage Decomposition)

Execute a three-step BFS search within each tree. In step one, the LLM proposes ways to combine two of the four numbers into one result, reducing to three numbers. In step two, the LLM proposes ways to combine two of the three remaining numbers. In the final step, the LLM computes the expression from the last two numbers. At each step, a value function evaluates partial solutions, and the best candidates are selected for the next step.

Key considerations:

Step 1 uses few-shot prompted proposals with optional arithmetic self-correction
Step 2 uses a specialized three-to-two-number reduction prompt
Step 3 uses a direct two-number computation that produces the final expression
Value evaluation prompts the LLM to assess whether partial states are "sure", "likely", or "impossible" to reach 24
Greedy selection keeps the top-k candidates by value score; sample selection uses value-proportional sampling

Step 4: Solution Validation

Validate the generated expression by checking two conditions: (1) the expression uses exactly the four original input numbers, and (2) the expression evaluates to 24. Validation uses Python eval for arithmetic evaluation and counter-based number matching. If a valid solution is found in any tree, the forest search terminates early for that problem.

Key considerations:

Expression validation catches malformed outputs (wrong numbers, incorrect arithmetic)
Early termination across trees saves compute when one tree finds a valid solution
Invalid expressions from one tree do not prevent other trees from attempting the problem
Results include the valid expression string and per-tree decode counts

Step 5: Result Logging

Write incremental results to a JSON log file after each problem. Each entry includes the problem index, the solution (if found), per-step search information (proposals, values, selections), running accuracy statistics, decode counts, and elapsed wall-clock time.

Key considerations:

Log filenames encode backend, temperature, prompt type, index range, and tree count
Results are overwritten after each problem for crash resilience
Running accuracy percentage is computed against the total problem range
Decode counts track total LLM inference calls for efficiency analysis

Execution Diagram

GitHub URL

Workflow Repository