Workflow:Princeton nlp Tree of thought llm Baseline comparison

Knowledge Sources	Tree of Thought LLM Tree of Thoughts: Deliberate Problem Solving with Large Language Models
Domains	LLMs, Reasoning, Evaluation
Last Updated	2025-02-14 03:00 GMT

Overview

End-to-end process for running naive IO and Chain-of-Thought (CoT) sampling baselines on benchmark tasks to compare against Tree of Thoughts performance.

Description

This workflow executes the standard prompting baselines described in the ToT paper. Rather than performing tree search, these baselines generate complete solutions in a single LLM call (IO prompting) or with step-by-step reasoning in a single call (CoT prompting). Multiple samples are drawn per problem to allow any-correct evaluation. These baselines serve as the control group against which the ToT BFS results are measured.

Goal: A JSON log file containing baseline solution samples, per-problem accuracy, and cumulative API token usage/cost for comparison against ToT.

Scope: From CLI argument parsing through single-pass sampling to result logging.

Strategy: Uses the naive_solve() function which generates n samples in parallel via a single prompt (no iterative search), then evaluates all samples against the ground truth.

Usage

Execute this workflow when you want to establish baseline performance for IO or CoT prompting on a benchmark task, typically to compare against a corresponding ToT BFS experiment. You need the same prerequisites as the ToT experiment (OpenAI API key, installed package, task data).

Execution Steps

Step 1: Environment setup

Configure the runtime environment identically to the ToT BFS workflow. Set the OpenAI API key environment variable and ensure the tot package is installed with all task data files available.

Key considerations:

Same environment requirements as the ToT BFS experiment
Baselines use the same LLM backend (GPT-4 by default) for fair comparison

Step 2: Configure baseline parameters

Select the task and prompting strategy via command-line arguments to run.py with the --naive_run flag. Choose between standard (IO) and cot prompting via --prompt_sample. Set --n_generate_sample to control how many solution samples are drawn per problem.

Key considerations:

The --naive_run flag activates the baseline path instead of BFS
--prompt_sample standard uses direct input-output prompting (few-shot examples only)
--prompt_sample cot uses chain-of-thought prompting (few-shot examples with reasoning steps)
Paper baselines use n_generate_sample=100 for Game of 24 and n_generate_sample=10 for other tasks
method_evaluate, method_select, and n_select_sample are ignored in naive mode

Step 3: Task instantiation

Load the task object using the same get_task() factory. The task loads its dataset and prompt templates but the multi-step BFS configuration (steps, stops) is not used in naive mode except that stops is set to None for single-pass generation.

Key considerations:

The same Task subclass is used for both baselines and ToT
Prompt templates (standard_prompt and cot_prompt) are defined per task in src/tot/prompts/

Step 4: Naive sampling

The naive_solve() function generates all solution candidates in a single call. It wraps the input into a prompt using the selected prompting strategy (standard or CoT), then calls the LLM with n=n_generate_sample to get multiple completions in one API request.

What happens:

The input is formatted using either standard_prompt_wrap() or cot_prompt_wrap()
The LLM is called once with n samples requested (batched in groups of 20 by the API wrapper)
All n completions are returned as candidate solutions
No iterative evaluation or selection occurs

Step 5: Result validation and logging

Validate each candidate solution using the task-specific test_output() method and log results. The logging format and metric computation are identical to the ToT workflow.

Key considerations:

Same validation logic as the ToT workflow (sympy for Game of 24, GPT-4 scoring for Creative Writing)
Log filenames include "naive" to distinguish from ToT logs
Same accuracy metrics (avg and any-correct) are computed for comparison

Step 6: Aggregate and compare metrics

Compute final aggregate accuracy and API usage. These numbers can be directly compared against the ToT BFS experiment results to measure the benefit of tree search.

Key considerations:

Baselines typically use fewer API calls per problem than ToT but may have lower accuracy
Paper results: Game of 24 IO=7.3%, CoT=4.0%, ToT=74%; Creative Writing IO=6.19, CoT=6.93, ToT=7.56

Execution Diagram

GitHub URL

Workflow Repository