Workflow:Princeton nlp Tree of thought llm ToT BFS experiment

Knowledge Sources	Tree of Thought LLM Tree of Thoughts: Deliberate Problem Solving with Large Language Models
Domains	LLMs, Reasoning, Search_Algorithms
Last Updated	2025-02-14 03:00 GMT

Overview

End-to-end process for running a Tree of Thoughts Breadth-First Search (BFS) experiment on a supported benchmark task using LLM-generated thought candidates.

Description

This workflow executes the core Tree of Thoughts algorithm as described in the paper by Yao et al. (2023). The process takes a benchmark task (Game of 24, Creative Writing, or Mini Crosswords), generates candidate "thoughts" (intermediate reasoning steps) using an LLM, evaluates those candidates via LLM-based scoring, selects the most promising candidates, and iterates for a configurable number of steps. The final output is a set of candidate solutions along with logged trajectories and accuracy metrics.

Goal: A JSON log file containing solution trajectories, per-problem accuracy, and cumulative API token usage/cost.

Scope: From CLI argument parsing through the multi-step BFS search loop to result logging and metric aggregation.

Strategy: Uses configurable generation (sample or propose), evaluation (value or vote), and selection (sample or greedy) methods at each BFS step to explore the thought tree breadth-first while pruning unpromising branches.

Usage

Execute this workflow when you want to solve a benchmark reasoning task using the Tree of Thoughts BFS algorithm with an OpenAI GPT model. You need a valid OpenAI API key set as the OPENAI_API_KEY environment variable, the repository installed as a Python package, and a supported task dataset available in the data directory.

Execution Steps

Step 1: Environment setup

Configure the runtime environment by setting the OpenAI API key as an environment variable and installing the tot package either from PyPI or from source. Verify that the required task data files exist in the expected locations under src/tot/data/.

Key considerations:

The OPENAI_API_KEY environment variable must be set before running
Install via pip install tree-of-thoughts-llm or pip install -e . from the repo root
Game of 24 requires a 24.csv file in src/tot/data/24/
Creative Writing uses data_100_random_text.txt (included in repo)
Mini Crosswords uses mini0505.json (included in repo)

Step 2: Configure experiment parameters

Select the task, LLM backend, and BFS hyperparameters via command-line arguments to run.py. The key decisions are which task to run, which generation strategy to use (sample vs. propose), which evaluation strategy to use (value vs. vote), and the beam width (n_select_sample).

Key considerations:

Game of 24 typically uses propose generation + value evaluation + greedy selection
Creative Writing typically uses sample generation + vote evaluation + greedy selection
n_select_sample controls beam width (number of candidates kept per step, i.e., b in the paper)
n_generate_sample and n_evaluate_sample control how many LLM calls per generation/evaluation
Temperature affects diversity of generated thoughts (0.7 for Game of 24, 1.0 for Creative Writing)

Step 3: Task instantiation

The get_task() factory function loads the appropriate Task subclass based on the task argument. The task object loads its dataset, configures the number of BFS steps, defines stop tokens for generation, and initializes any caches (e.g., value_cache for Game of 24).

What happens:

Task class loads data from CSV/JSON/text file under src/tot/data/
Sets self.steps (number of BFS iterations: 4 for Game of 24, 2 for Creative Writing)
Sets self.stops (stop token sequences per step)
Initializes value_cache dict for deduplicating LLM evaluation calls

Step 4: BFS search loop (generate, evaluate, select)

The solve() function runs the core BFS algorithm. For each step, it generates new thought candidates from all current candidates, evaluates every new candidate using the LLM, and selects the top-k candidates based on scores. This three-phase loop repeats for the number of steps defined by the task.

What happens at each step:

Generate: For each current candidate y, produce new candidates. Sample mode prompts the LLM n times independently. Propose mode asks the LLM to enumerate possible next steps.
Evaluate: Score all new candidates. Value mode scores each candidate independently (returning sure/likely/impossible mapped to numeric values). Vote mode presents all candidates together and asks the LLM to pick the best.
Select: Keep the top n_select_sample candidates. Greedy mode sorts by score. Sample mode samples proportionally to scores.
The LLM calls are wrapped with exponential backoff retry logic for API reliability.

Step 5: Result validation and logging

After the BFS loop completes for each problem instance, validate the candidate solutions using the task-specific test_output() method. Log the full trajectory (all intermediate steps, candidates, scores, and selections) along with accuracy metrics and cumulative API token usage to a JSON file.

Key considerations:

Game of 24 validation uses sympy to check if the arithmetic expression equals 24
Creative Writing validation calls GPT-4 to produce a coherency score (1-10)
Logs are saved to logs/{task}/ with filenames encoding all hyperparameters
Cumulative accuracy (avg and any-correct) is printed to stdout after each problem
Token usage and estimated cost are tracked globally and saved in the log

Step 6: Aggregate and report metrics

After all problem instances have been processed, compute and print the final aggregate accuracy metrics (average accuracy and any-correct rate) across all problems, along with total API usage statistics.

Key considerations:

cnt_avg tracks the mean accuracy across all problems
cnt_any tracks the fraction of problems where at least one candidate was correct
Final API cost is reported broken down by completion and prompt tokens

Execution Diagram

GitHub URL

Workflow Repository