Workflow:Princeton nlp Tree of thought llm Adding new task

Knowledge Sources	Tree of Thought LLM Tree of Thoughts: Deliberate Problem Solving with Large Language Models
Domains	LLMs, Reasoning, Framework_Extension
Last Updated	2025-02-14 03:00 GMT

Overview

End-to-end process for extending the Tree of Thoughts framework with a new benchmark task by implementing the Task interface, defining prompt templates, and registering the task in the factory.

Description

This workflow guides a developer through adding a new reasoning task to the ToT framework. The framework uses a Task base class interface with standardized methods for data loading, prompt construction, output validation, and evaluation. A new task requires implementing this interface, crafting few-shot prompt templates for the chosen generation and evaluation strategies, placing data files in the correct directory, and wiring everything up in the task registry.

Goal: A fully integrated new task that can be run with both naive baselines and ToT BFS via the existing run.py CLI.

Scope: From choosing the generation/evaluation strategy through implementing all required classes and prompts to running a successful experiment.

Strategy: Follow the existing Game of 24 implementation as a reference pattern, adapting the task-specific logic (data loading, prompt templates, output validation) to the new domain.

Usage

Execute this workflow when you have a new reasoning or generation task that you want to evaluate using Tree of Thoughts search. The task should have clearly defined inputs, outputs, and a way to validate correctness. You need to decide whether to use sample vs. propose for generation and value vs. vote for evaluation based on the nature of the task.

Execution Steps

Step 1: Design the task interface

Analyze the new task to determine its input format, output format, number of reasoning steps, evaluation criteria, and which generation/evaluation strategies are appropriate. Decide between propose (sequential thought enumeration) and sample (independent sampling) for generation, and between value (independent scoring) and vote (comparative ranking) for evaluation.

Key considerations:

Propose works well when thoughts can be enumerated (e.g., arithmetic operations on a set of numbers)
Sample works well when thoughts are more open-ended (e.g., creative writing paragraphs)
Value works well when individual candidates can be assessed in isolation
Vote works well when candidates need to be compared against each other
Number of steps should match the natural decomposition depth of the task

Step 2: Prepare task data

Create or format the dataset file and place it in the appropriate directory under src/tot/data/. The data file should contain all problem instances that the task will iterate over.

Key considerations:

Data goes in src/tot/data/{task_name}/ directory
Format should be loadable in __init__ (CSV, JSON, or text are common)
The Task.__len__() method must return the total number of problem instances
get_input(idx) must return the string input for problem index idx

Step 3: Implement the Task subclass

Create a new Python file in src/tot/tasks/ that extends the Task base class. Implement all required methods: __init__ (load data, set steps/stops), __len__, get_input, test_output, and the prompt wrapping methods needed for the chosen strategies.

Required methods:

__init__: Load data file, set self.steps (BFS depth) and self.stops (generation stop tokens per step)
__len__: Return dataset size
get_input(idx): Return input string for problem idx
test_output(idx, output): Validate solution and return dict with key 'r' (reward 0 or 1, or a score)
standard_prompt_wrap(x, y): Format input for IO baseline
cot_prompt_wrap(x, y): Format input for CoT baseline
For propose strategy: propose_prompt_wrap(x, y) and parse proposed thoughts
For value strategy: value_prompt_wrap(x, y) and value_outputs_unwrap(x, y, outputs)
For vote strategy: vote_prompt_wrap(x, ys) and vote_outputs_unwrap(outputs, n_candidates)

Step 4: Create prompt templates

Create a new Python file in src/tot/prompts/ containing all few-shot prompt templates for the task. Each template should include carefully crafted examples that demonstrate the expected input/output format.

Key considerations:

standard_prompt: Direct input-output few-shot examples (typically 3-5 shots)
cot_prompt: Step-by-step reasoning examples
propose_prompt: Examples showing how to enumerate possible next steps (if using propose)
value_prompt: Examples of evaluating intermediate states with a quality label (if using value)
vote_prompt: Instructions for comparative ranking of candidates (if using vote)
Prompt quality is critical as it directly determines thought generation and evaluation quality

Step 5: Register task in factory

Add the new task to the get_task() factory function in src/tot/tasks/__init__.py and add the task name to the --task argument choices in run.py.

What to update:

src/tot/tasks/__init__.py: Add an elif branch that imports and returns the new Task subclass
run.py: Add the task name to the choices list in the --task argparse argument

Step 6: Create experiment scripts

Create shell scripts in scripts/{task_name}/ for running the new task with different methods (standard baseline, CoT baseline, ToT BFS). These scripts encode the recommended hyperparameters for the task.

Key considerations:

Create at least three scripts: standard_sampling.sh, cot_sampling.sh, and bfs.sh
Each script calls run.py with the appropriate arguments for the task
Set task_start_index and task_end_index to cover the desired evaluation range
Choose n_generate_sample, n_evaluate_sample, and n_select_sample based on task complexity

Step 7: Validate with a test run

Run the experiment scripts on a small subset of problems to verify correctness. Check that prompts are well-formed, LLM outputs are parsed correctly, validation logic works, and logs are saved properly.

Key considerations:

Start with a small range (e.g., task_start_index=0, task_end_index=5) to minimize API cost
Verify that test_output() correctly identifies valid and invalid solutions
Check log files for proper structure and reasonable scores
Compare naive baseline accuracy against ToT to confirm the tree search adds value

Execution Diagram

GitHub URL

Workflow Repository