Implementation:Princeton nlp Tree of thought llm Test Output
| Knowledge Sources | |
|---|---|
| Domains | Evaluation, NLP |
| Last Updated | 2026-02-14 03:30 GMT |
Overview
Concrete tool for validating generated solutions against task-specific correctness criteria provided by each Task subclass.
Description
The test_output method is defined on the Task base class and implemented differently by each task subclass. It takes a puzzle index and a candidate solution string, then returns a dictionary containing reward/accuracy metrics. The calling convention is uniform across all tasks, enabling the experiment loop to validate any task's outputs generically.
Task-specific implementations:
- Game24Task.test_output: Extracts an arithmetic expression from the last line, verifies it uses the correct input numbers via regex, and evaluates it symbolically with sympy.simplify to check if it equals 24.
- TextTask.test_output: Sends the generated passage to GPT-4 with a scoring prompt, regex-extracts numeric coherency scores from 5 LLM judges, and averages them.
- MiniCrosswordsTask.test_output: Compares the generated grid against the known solution, computing word-level, letter-level, and overall game accuracy.
Usage
Called from the experiment loop in run.py:L26 for each candidate solution y in the output set ys. The returned dict is stored in the JSON log file.
Code Reference
Source Location
- Repository: tree-of-thought-llm
- File: src/tot/tasks/game24.py (Lines 44-55), src/tot/tasks/text.py (Lines 32-49), src/tot/tasks/crosswords.py (Lines 190-202)
Signature
# Game24Task
def test_output(self, idx: int, output: str):
"""
Validate a Game of 24 solution.
Args:
idx (int): Puzzle index for ground truth lookup.
output (str): Candidate solution string.
Returns:
dict: {'r': int} where r is 0 or 1.
"""
# TextTask
def test_output(self, idx: int, output: str):
"""
Score a creative writing passage using LLM judges.
Args:
idx (int): Puzzle index (unused, included for interface consistency).
output (str): Generated passage text.
Returns:
dict: {'rs': list[int], 'r': float} with individual scores and average.
"""
# MiniCrosswordsTask
def test_output(self, idx: int, output: str):
"""
Validate a crosswords solution against known answers.
Args:
idx (int): Puzzle index.
output (str): Generated crossword grid.
Returns:
dict: {'r_word': float, 'r_letter': float, 'r_game': int}
"""
Import
# test_output is a method on Task objects, not imported directly
from tot.tasks import get_task
task = get_task('game24')
result = task.test_output(idx, output)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| idx | int | Yes | Puzzle index for ground truth lookup |
| output | str | Yes | Candidate solution string to validate |
Outputs
| Name | Type | Description |
|---|---|---|
| Game24 return | dict | {'r': 0 or 1} — binary correctness |
| Text return | dict | {'rs': list[int], 'r': float} — individual LLM scores and average coherency score |
| Crosswords return | dict | {'r_word': float, 'r_letter': float, 'r_game': int} — multi-level accuracy |
Usage Examples
Validating Game of 24 Output
from tot.tasks import get_task
task = get_task('game24')
# Correct solution
result = task.test_output(900, "1 + 2 = 3\n3 + 3 = 6\n6 * 4 = 24\n(1 + 2 + 3) * 4 = 24")
print(result) # {'r': 1}
# Incorrect solution
result = task.test_output(900, "1 + 2 = 3\n3 + 4 = 7\nAnswer: 7")
print(result) # {'r': 0}
Validating Creative Writing Output
task = get_task('text')
result = task.test_output(0, "Passage:\nOnce upon a time...")
# result = {'rs': [8, 7, 9, 8, 7], 'r': 7.8}
# Uses 5 GPT-4 calls to score coherency