Implementation:Princeton nlp Tree of thought llm Test Output

Knowledge Sources	tree-of-thought-llm
Domains	Evaluation, NLP
Last Updated	2026-02-14 03:30 GMT

Overview

Concrete tool for validating generated solutions against task-specific correctness criteria provided by each Task subclass.

Description

The test_output method is defined on the Task base class and implemented differently by each task subclass. It takes a puzzle index and a candidate solution string, then returns a dictionary containing reward/accuracy metrics. The calling convention is uniform across all tasks, enabling the experiment loop to validate any task's outputs generically.

Task-specific implementations:

Game24Task.test_output: Extracts an arithmetic expression from the last line, verifies it uses the correct input numbers via regex, and evaluates it symbolically with sympy.simplify to check if it equals 24.
TextTask.test_output: Sends the generated passage to GPT-4 with a scoring prompt, regex-extracts numeric coherency scores from 5 LLM judges, and averages them.
MiniCrosswordsTask.test_output: Compares the generated grid against the known solution, computing word-level, letter-level, and overall game accuracy.

Usage

Called from the experiment loop in run.py:L26 for each candidate solution y in the output set ys. The returned dict is stored in the JSON log file.

Code Reference

Source Location

Repository: tree-of-thought-llm
File: src/tot/tasks/game24.py (Lines 44-55), src/tot/tasks/text.py (Lines 32-49), src/tot/tasks/crosswords.py (Lines 190-202)

Signature

# Game24Task
def test_output(self, idx: int, output: str):
    """
    Validate a Game of 24 solution.

    Args:
        idx (int): Puzzle index for ground truth lookup.
        output (str): Candidate solution string.

    Returns:
        dict: {'r': int} where r is 0 or 1.
    """

# TextTask
def test_output(self, idx: int, output: str):
    """
    Score a creative writing passage using LLM judges.

    Args:
        idx (int): Puzzle index (unused, included for interface consistency).
        output (str): Generated passage text.

    Returns:
        dict: {'rs': list[int], 'r': float} with individual scores and average.
    """

# MiniCrosswordsTask
def test_output(self, idx: int, output: str):
    """
    Validate a crosswords solution against known answers.

    Args:
        idx (int): Puzzle index.
        output (str): Generated crossword grid.

    Returns:
        dict: {'r_word': float, 'r_letter': float, 'r_game': int}
    """

Import

# test_output is a method on Task objects, not imported directly
from tot.tasks import get_task

task = get_task('game24')
result = task.test_output(idx, output)

I/O Contract

Inputs

Name	Type	Required	Description
idx	int	Yes	Puzzle index for ground truth lookup
output	str	Yes	Candidate solution string to validate

Outputs

Name	Type	Description
Game24 return	dict	{'r': 0 or 1} — binary correctness
Text return	dict	{'rs': list[int], 'r': float} — individual LLM scores and average coherency score
Crosswords return	dict	{'r_word': float, 'r_letter': float, 'r_game': int} — multi-level accuracy

Usage Examples

Validating Game of 24 Output

from tot.tasks import get_task

task = get_task('game24')

# Correct solution
result = task.test_output(900, "1 + 2 = 3\n3 + 3 = 6\n6 * 4 = 24\n(1 + 2 + 3) * 4 = 24")
print(result)  # {'r': 1}

# Incorrect solution
result = task.test_output(900, "1 + 2 = 3\n3 + 4 = 7\nAnswer: 7")
print(result)  # {'r': 0}

Validating Creative Writing Output

task = get_task('text')

result = task.test_output(0, "Passage:\nOnce upon a time...")
# result = {'rs': [8, 7, 9, 8, 7], 'r': 7.8}
# Uses 5 GPT-4 calls to score coherency

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment