Workflow:Iamhankai Forest of Thought FoT Benchmark Evaluation

Knowledge Sources	Forest-of-Thought Forest-of-Thought: Scaling Test-Time Compute
Domains	LLM_Reasoning, Test_Time_Compute, Benchmarking
Last Updated	2026-02-14 03:00 GMT

Overview

End-to-end process for evaluating Large Language Model reasoning on math benchmarks (GSM8K, MATH500, AIME) using the Forest-of-Thought framework with configurable base reasoning modes (MCTS, CoT, ToT) and consensus-guided decision making.

Description

This workflow implements the core Forest-of-Thought evaluation pipeline for mathematical reasoning benchmarks. It loads a local HuggingFace language model, reads a benchmark dataset, and runs multiple independent reasoning trees per problem. Each tree uses one of three base reasoning modes: Monte Carlo Tree Search with self-refinement (MCTS), Chain-of-Thought (CoT), or Tree-of-Thought (ToT). After all trees complete for a given problem, the framework aggregates answers via majority voting with optional early stopping. When no majority consensus is reached, a Consensus-Guided Decision Making (CGDM) strategy invokes an LLM-as-judge to select the best answer. Results including per-tree answers, scores, and correctness are written to JSON output files.

Usage

Execute this workflow when you have a mathematical reasoning benchmark dataset (GSM8K, MATH500, or AIME format) and want to evaluate an LLM's reasoning capabilities using Forest-of-Thought. The workflow requires a local HuggingFace model with GPU access (CUDA). It supports models from the Llama, Qwen, GLM, DeepSeek, and Mistral families. Use this when you need to reproduce the paper's benchmark results or evaluate a new model on established math reasoning tasks.

Execution Steps

Step 1: Environment Setup and Argument Parsing

Parse command-line arguments that configure the evaluation run. Key parameters include the model path, model type, dataset name, dataset file path, number of trees in the forest, maximum MCTS iterations per tree, the stopping strategy (CGDM, majority, score, random), and the base reasoning mode (MCTS, CoT, or ToT). Optional flags enable dynamic self-correction with a configurable confidence threshold.

Key considerations:

The dataset identifier string encodes naming conventions used for output paths and answer format detection
Start and end indices allow partial evaluation over dataset subsets
The base mode selection determines which reasoning algorithm each tree executes

Step 2: Model Loading

Initialize the local language model pipeline. The loader auto-detects the model family (Llama, Qwen, GLM, DeepSeek, Mistral) from the model path name and configures appropriate tokenization and generation settings. Models are loaded in bfloat16 or float16 precision with automatic device mapping across available GPUs.

Key considerations:

Qwen models receive a system prompt instructing step-by-step reasoning with boxed final answers
Mistral models use a text-generation pipeline rather than direct model loading
The Pipeline class tracks inference count and tokens-per-second for performance monitoring
Dynamic self-correction is enabled per-model via a confidence threshold on log-probabilities

Step 3: Dataset Loading

Load the benchmark dataset from a Parquet or JSONL file. The loader extracts query-ground truth pairs based on the dataset type, handling different column schemas for GSM8K, MATH500, and AIME datasets. Queries and ground truth labels are paired for iterative evaluation.

Key considerations:

Different datasets use different column names (question/problem, answer/solution)
The answer format template is derived from the dataset type to guide model output formatting
Few-shot learning examples are loaded from a pre-built example library matched to the dataset domain

Step 4: Forest Construction and Tree Execution

For each problem in the dataset, construct a forest of independent reasoning trees. The first tree uses a direct query (quick thinking), while subsequent trees prepend a similar example from the few-shot library to introduce input diversity (slow thinking). Each tree runs the selected base reasoning mode:

MCTS mode: Generate an initial answer, then iteratively select nodes via UCB scores, generate reflections/hints, refine answers, and backpropagate reward scores through the tree.

CoT mode: Generate a single chain-of-thought answer per tree, with diversity coming from the few-shot prefix variation.

ToT mode: Run a structured tree search with step-by-step proposals, value-based evaluation, and breadth-first or depth-first expansion.

Key considerations:

Tree count is configurable; more trees increase coverage but also compute cost
MCTS iterations per tree increase search depth within each tree
UCB (Upper Confidence Bound) balances exploration vs exploitation in node selection
Reward scoring uses an LLM self-critique mechanism that rates answers on a 0-100 scale

Step 5: Consensus and Early Stopping

After each tree completes, check if a majority consensus has been reached across all tree answers so far. If more than half the trees agree on the same extracted answer, trigger early stopping to save compute. The extracted answer is compared using dataset-specific label extraction (handling boxed answers, #### markers, numeric parsing).

Key considerations:

Early stopping only activates after at least two trees have completed
Answer extraction normalizes different output formats to comparable labels
For MATH-type datasets, symbolic equivalence checking via sympy is used
The majority threshold is strictly greater than half the total tree count

Step 6: Final Answer Selection (CGDM)

When no majority consensus is reached after all trees complete, apply the configured stopping strategy to select the final answer. Under CGDM (the default), first attempt majority voting on extracted answers. If tied, invoke an LLM-as-expert-judge that receives the original question and all candidate answers, then selects the best one. Fallback strategies include random selection and score-based selection.

Key considerations:

The expert judge response is cached per question to avoid redundant LLM calls
If the judge selects a nonsensical answer (like "I Don't Know"), a fresh CoT generation is used as fallback
Score-based selection uses a weighted combination of reward scores, visit counts, and UCB values
The expert memory dictionary persists across problems within a single evaluation run

Step 7: Result Logging and Accuracy Tracking

For each problem, record the final answer, per-tree answers, reward scores, UCB values, correctness against ground truth, and running accuracy statistics. Results are written incrementally to a JSON file (one entry per problem) so partial results survive interruptions. A running correct count and total count are maintained for real-time accuracy monitoring.

Key considerations:

Output filenames are derived from dataset name, model name, and tree count
The JSON output includes full tree exploration data for post-hoc analysis
Correctness checking uses dataset-appropriate comparison (numeric for GSM8K, symbolic for MATH)
Final statistics include total inference calls and average tokens-per-second

Execution Diagram

GitHub URL

Workflow Repository