Workflow:Iamhankai Forest of Thought FoT Benchmark Evaluation
| Knowledge Sources | |
|---|---|
| Domains | LLM_Reasoning, Test_Time_Compute, Benchmarking |
| Last Updated | 2026-02-14 03:00 GMT |
Overview
End-to-end process for evaluating Large Language Model reasoning on math benchmarks (GSM8K, MATH500, AIME) using the Forest-of-Thought framework with configurable base reasoning modes (MCTS, CoT, ToT) and consensus-guided decision making.
Description
This workflow implements the core Forest-of-Thought evaluation pipeline for mathematical reasoning benchmarks. It loads a local HuggingFace language model, reads a benchmark dataset, and runs multiple independent reasoning trees per problem. Each tree uses one of three base reasoning modes: Monte Carlo Tree Search with self-refinement (MCTS), Chain-of-Thought (CoT), or Tree-of-Thought (ToT). After all trees complete for a given problem, the framework aggregates answers via majority voting with optional early stopping. When no majority consensus is reached, a Consensus-Guided Decision Making (CGDM) strategy invokes an LLM-as-judge to select the best answer. Results including per-tree answers, scores, and correctness are written to JSON output files.
Usage
Execute this workflow when you have a mathematical reasoning benchmark dataset (GSM8K, MATH500, or AIME format) and want to evaluate an LLM's reasoning capabilities using Forest-of-Thought. The workflow requires a local HuggingFace model with GPU access (CUDA). It supports models from the Llama, Qwen, GLM, DeepSeek, and Mistral families. Use this when you need to reproduce the paper's benchmark results or evaluate a new model on established math reasoning tasks.
Execution Steps
Step 1: Environment Setup and Argument Parsing
Parse command-line arguments that configure the evaluation run. Key parameters include the model path, model type, dataset name, dataset file path, number of trees in the forest, maximum MCTS iterations per tree, the stopping strategy (CGDM, majority, score, random), and the base reasoning mode (MCTS, CoT, or ToT). Optional flags enable dynamic self-correction with a configurable confidence threshold.
Key considerations:
- The dataset identifier string encodes naming conventions used for output paths and answer format detection
- Start and end indices allow partial evaluation over dataset subsets
- The base mode selection determines which reasoning algorithm each tree executes
Step 2: Model Loading
Initialize the local language model pipeline. The loader auto-detects the model family (Llama, Qwen, GLM, DeepSeek, Mistral) from the model path name and configures appropriate tokenization and generation settings. Models are loaded in bfloat16 or float16 precision with automatic device mapping across available GPUs.
Key considerations:
- Qwen models receive a system prompt instructing step-by-step reasoning with boxed final answers
- Mistral models use a text-generation pipeline rather than direct model loading
- The Pipeline class tracks inference count and tokens-per-second for performance monitoring
- Dynamic self-correction is enabled per-model via a confidence threshold on log-probabilities
Step 3: Dataset Loading
Load the benchmark dataset from a Parquet or JSONL file. The loader extracts query-ground truth pairs based on the dataset type, handling different column schemas for GSM8K, MATH500, and AIME datasets. Queries and ground truth labels are paired for iterative evaluation.
Key considerations:
- Different datasets use different column names (question/problem, answer/solution)
- The answer format template is derived from the dataset type to guide model output formatting
- Few-shot learning examples are loaded from a pre-built example library matched to the dataset domain
Step 4: Forest Construction and Tree Execution
For each problem in the dataset, construct a forest of independent reasoning trees. The first tree uses a direct query (quick thinking), while subsequent trees prepend a similar example from the few-shot library to introduce input diversity (slow thinking). Each tree runs the selected base reasoning mode:
MCTS mode: Generate an initial answer, then iteratively select nodes via UCB scores, generate reflections/hints, refine answers, and backpropagate reward scores through the tree.
CoT mode: Generate a single chain-of-thought answer per tree, with diversity coming from the few-shot prefix variation.
ToT mode: Run a structured tree search with step-by-step proposals, value-based evaluation, and breadth-first or depth-first expansion.
Key considerations:
- Tree count is configurable; more trees increase coverage but also compute cost
- MCTS iterations per tree increase search depth within each tree
- UCB (Upper Confidence Bound) balances exploration vs exploitation in node selection
- Reward scoring uses an LLM self-critique mechanism that rates answers on a 0-100 scale
Step 5: Consensus and Early Stopping
After each tree completes, check if a majority consensus has been reached across all tree answers so far. If more than half the trees agree on the same extracted answer, trigger early stopping to save compute. The extracted answer is compared using dataset-specific label extraction (handling boxed answers, #### markers, numeric parsing).
Key considerations:
- Early stopping only activates after at least two trees have completed
- Answer extraction normalizes different output formats to comparable labels
- For MATH-type datasets, symbolic equivalence checking via sympy is used
- The majority threshold is strictly greater than half the total tree count
Step 6: Final Answer Selection (CGDM)
When no majority consensus is reached after all trees complete, apply the configured stopping strategy to select the final answer. Under CGDM (the default), first attempt majority voting on extracted answers. If tied, invoke an LLM-as-expert-judge that receives the original question and all candidate answers, then selects the best one. Fallback strategies include random selection and score-based selection.
Key considerations:
- The expert judge response is cached per question to avoid redundant LLM calls
- If the judge selects a nonsensical answer (like "I Don't Know"), a fresh CoT generation is used as fallback
- Score-based selection uses a weighted combination of reward scores, visit counts, and UCB values
- The expert memory dictionary persists across problems within a single evaluation run
Step 7: Result Logging and Accuracy Tracking
For each problem, record the final answer, per-tree answers, reward scores, UCB values, correctness against ground truth, and running accuracy statistics. Results are written incrementally to a JSON file (one entry per problem) so partial results survive interruptions. A running correct count and total count are maintained for real-time accuracy monitoring.
Key considerations:
- Output filenames are derived from dataset name, model name, and tree count
- The JSON output includes full tree exploration data for post-hoc analysis
- Correctness checking uses dataset-appropriate comparison (numeric for GSM8K, symbolic for MATH)
- Final statistics include total inference calls and average tokens-per-second