Workflow:Princeton nlp Tree of thought llm Adding new task
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Reasoning, Framework_Extension |
| Last Updated | 2025-02-14 03:00 GMT |
Overview
End-to-end process for extending the Tree of Thoughts framework with a new benchmark task by implementing the Task interface, defining prompt templates, and registering the task in the factory.
Description
This workflow guides a developer through adding a new reasoning task to the ToT framework. The framework uses a Task base class interface with standardized methods for data loading, prompt construction, output validation, and evaluation. A new task requires implementing this interface, crafting few-shot prompt templates for the chosen generation and evaluation strategies, placing data files in the correct directory, and wiring everything up in the task registry.
Goal: A fully integrated new task that can be run with both naive baselines and ToT BFS via the existing run.py CLI.
Scope: From choosing the generation/evaluation strategy through implementing all required classes and prompts to running a successful experiment.
Strategy: Follow the existing Game of 24 implementation as a reference pattern, adapting the task-specific logic (data loading, prompt templates, output validation) to the new domain.
Usage
Execute this workflow when you have a new reasoning or generation task that you want to evaluate using Tree of Thoughts search. The task should have clearly defined inputs, outputs, and a way to validate correctness. You need to decide whether to use sample vs. propose for generation and value vs. vote for evaluation based on the nature of the task.
Execution Steps
Step 1: Design the task interface
Analyze the new task to determine its input format, output format, number of reasoning steps, evaluation criteria, and which generation/evaluation strategies are appropriate. Decide between propose (sequential thought enumeration) and sample (independent sampling) for generation, and between value (independent scoring) and vote (comparative ranking) for evaluation.
Key considerations:
- Propose works well when thoughts can be enumerated (e.g., arithmetic operations on a set of numbers)
- Sample works well when thoughts are more open-ended (e.g., creative writing paragraphs)
- Value works well when individual candidates can be assessed in isolation
- Vote works well when candidates need to be compared against each other
- Number of steps should match the natural decomposition depth of the task
Step 2: Prepare task data
Create or format the dataset file and place it in the appropriate directory under src/tot/data/. The data file should contain all problem instances that the task will iterate over.
Key considerations:
- Data goes in src/tot/data/{task_name}/ directory
- Format should be loadable in __init__ (CSV, JSON, or text are common)
- The Task.__len__() method must return the total number of problem instances
- get_input(idx) must return the string input for problem index idx
Step 3: Implement the Task subclass
Create a new Python file in src/tot/tasks/ that extends the Task base class. Implement all required methods: __init__ (load data, set steps/stops), __len__, get_input, test_output, and the prompt wrapping methods needed for the chosen strategies.
Required methods:
- __init__: Load data file, set self.steps (BFS depth) and self.stops (generation stop tokens per step)
- __len__: Return dataset size
- get_input(idx): Return input string for problem idx
- test_output(idx, output): Validate solution and return dict with key 'r' (reward 0 or 1, or a score)
- standard_prompt_wrap(x, y): Format input for IO baseline
- cot_prompt_wrap(x, y): Format input for CoT baseline
- For propose strategy: propose_prompt_wrap(x, y) and parse proposed thoughts
- For value strategy: value_prompt_wrap(x, y) and value_outputs_unwrap(x, y, outputs)
- For vote strategy: vote_prompt_wrap(x, ys) and vote_outputs_unwrap(outputs, n_candidates)
Step 4: Create prompt templates
Create a new Python file in src/tot/prompts/ containing all few-shot prompt templates for the task. Each template should include carefully crafted examples that demonstrate the expected input/output format.
Key considerations:
- standard_prompt: Direct input-output few-shot examples (typically 3-5 shots)
- cot_prompt: Step-by-step reasoning examples
- propose_prompt: Examples showing how to enumerate possible next steps (if using propose)
- value_prompt: Examples of evaluating intermediate states with a quality label (if using value)
- vote_prompt: Instructions for comparative ranking of candidates (if using vote)
- Prompt quality is critical as it directly determines thought generation and evaluation quality
Step 5: Register task in factory
Add the new task to the get_task() factory function in src/tot/tasks/__init__.py and add the task name to the --task argument choices in run.py.
What to update:
- src/tot/tasks/__init__.py: Add an elif branch that imports and returns the new Task subclass
- run.py: Add the task name to the choices list in the --task argparse argument
Step 6: Create experiment scripts
Create shell scripts in scripts/{task_name}/ for running the new task with different methods (standard baseline, CoT baseline, ToT BFS). These scripts encode the recommended hyperparameters for the task.
Key considerations:
- Create at least three scripts: standard_sampling.sh, cot_sampling.sh, and bfs.sh
- Each script calls run.py with the appropriate arguments for the task
- Set task_start_index and task_end_index to cover the desired evaluation range
- Choose n_generate_sample, n_evaluate_sample, and n_select_sample based on task complexity
Step 7: Validate with a test run
Run the experiment scripts on a small subset of problems to verify correctness. Check that prompts are well-formed, LLM outputs are parsed correctly, validation logic works, and logs are saved properly.
Key considerations:
- Start with a small range (e.g., task_start_index=0, task_end_index=5) to minimize API cost
- Verify that test_output() correctly identifies valid and invalid solutions
- Check log files for proper structure and reasonable scores
- Compare naive baseline accuracy against ToT to confirm the tree search adds value