Workflow:Iamhankai Forest of Thought Game24 Forest Solving
| Knowledge Sources | |
|---|---|
| Domains | LLM_Reasoning, Game_Solving, Tree_Search |
| Last Updated | 2026-02-14 03:00 GMT |
Overview
End-to-end process for solving Game of 24 puzzles using a Forest-of-Thought BFS-based tree search approach with multiple trees and self-correction.
Description
This workflow solves Game of 24 problems, where four input numbers must be combined using arithmetic operations to reach the value 24. It uses a multi-tree forest approach where each tree performs a structured Breadth-First Search (BFS) with three decomposition steps: reducing four numbers to three, then three to two, then computing the final expression. Each step involves LLM-based proposal generation with optional self-correction, value-based evaluation for pruning, and greedy or sample-based selection of the best partial solutions. Input diversity across trees is achieved by shuffling the order of the four input numbers. Solutions are validated by evaluating the generated arithmetic expression with sympy.
Usage
Execute this workflow when you want to evaluate an LLM's ability to solve Game of 24 puzzles using the Forest-of-Thought framework. This is a structured combinatorial reasoning task distinct from the open-ended math benchmarks. The workflow requires a local HuggingFace model (typically Llama-3-8B-Instruct) with GPU access. Use this for ablation studies on tree count, to compare BFS forest search against naive CoT baselines, or to reproduce the paper's Game24 experimental results.
Execution Steps
Step 1: Configuration and Model Loading
Parse command-line arguments specifying the model path, backend name, temperature, task index range, number of trees, search method (greedy or sample), evaluation sample count, and selection sample count. Load the local LLM using the Pipeline class, which wraps HuggingFace text-generation pipeline for the Game24 task with sampling enabled.
Key considerations:
- The Game24 task always uses the text-generation pipeline mode (not direct model loading)
- Temperature defaults to 0.7 for balanced exploration
- The correction flag enables post-hoc self-correction of LLM proposals
- Both naive (CoT baseline) and forest (BFS tree search) modes are supported via the naive_run flag
Step 2: Problem Initialization with Input Diversity
For each Game24 problem, retrieve the four input numbers from the dataset. Generate multiple permutations of these numbers, one per tree in the forest. The first tree uses the original number ordering, while subsequent trees use random shuffles to introduce input diversity across the forest.
Key considerations:
- Input diversity is achieved by shuffling the four numbers, not by prompt variation
- The number of permutations equals the configured tree count
- Different orderings can lead the LLM to discover different solution paths
- Problems are indexed from the 24-puzzle dataset (typically indices 0-96)
Step 3: BFS Tree Search (Three-Stage Decomposition)
Execute a three-step BFS search within each tree. In step one, the LLM proposes ways to combine two of the four numbers into one result, reducing to three numbers. In step two, the LLM proposes ways to combine two of the three remaining numbers. In the final step, the LLM computes the expression from the last two numbers. At each step, a value function evaluates partial solutions, and the best candidates are selected for the next step.
Key considerations:
- Step 1 uses few-shot prompted proposals with optional arithmetic self-correction
- Step 2 uses a specialized three-to-two-number reduction prompt
- Step 3 uses a direct two-number computation that produces the final expression
- Value evaluation prompts the LLM to assess whether partial states are "sure", "likely", or "impossible" to reach 24
- Greedy selection keeps the top-k candidates by value score; sample selection uses value-proportional sampling
Step 4: Solution Validation
Validate the generated expression by checking two conditions: (1) the expression uses exactly the four original input numbers, and (2) the expression evaluates to 24. Validation uses Python eval for arithmetic evaluation and counter-based number matching. If a valid solution is found in any tree, the forest search terminates early for that problem.
Key considerations:
- Expression validation catches malformed outputs (wrong numbers, incorrect arithmetic)
- Early termination across trees saves compute when one tree finds a valid solution
- Invalid expressions from one tree do not prevent other trees from attempting the problem
- Results include the valid expression string and per-tree decode counts
Step 5: Result Logging
Write incremental results to a JSON log file after each problem. Each entry includes the problem index, the solution (if found), per-step search information (proposals, values, selections), running accuracy statistics, decode counts, and elapsed wall-clock time.
Key considerations:
- Log filenames encode backend, temperature, prompt type, index range, and tree count
- Results are overwritten after each problem for crash resilience
- Running accuracy percentage is computed against the total problem range
- Decode counts track total LLM inference calls for efficiency analysis