Principle:Princeton nlp Tree of thought llm Result Validation
| Knowledge Sources | |
|---|---|
| Domains | Evaluation, NLP |
| Last Updated | 2026-02-14 03:30 GMT |
Overview
A task-specific validation mechanism that checks whether a generated solution is correct by comparing it against ground truth or applying domain-specific verification logic.
Description
Result Validation is the post-search step that determines the quality of generated solutions. Unlike thought evaluation (which is an LLM-based heuristic used during search), result validation provides the definitive correctness assessment used for computing experiment metrics. Each task implements its own validation logic:
- Game of 24: Parses the arithmetic expression from the output, verifies it uses exactly the input numbers, and checks via symbolic simplification (sympy) that it evaluates to 24. Returns binary 0/1.
- Creative Writing: Uses the LLM itself to score passage coherency on a numeric scale (via score_prompt). Returns average score across multiple LLM judges. This makes it a "soft" validation using AI evaluation.
- Crosswords: Checks word-level, letter-level, and game-level accuracy against the known solution grid.
Usage
Use this principle after the search (or baseline sampling) is complete, to compute final correctness metrics. It is called for every candidate in the solution set, and the results are logged alongside the search trajectory for analysis.
Theoretical Basis
Result validation maps a (puzzle_index, solution) pair to a score:
# Abstract pattern
def validate(task, idx, output):
ground_truth = task.get_ground_truth(idx)
if task.has_exact_answer:
return 1 if verify(output, ground_truth) else 0
else:
return soft_score(output) # e.g., LLM-based scoring
The two experiment-level metrics computed from individual validation results are:
- Failed to parse (syntax error): {\displaystyle \text{cnt\_avg} = \frac{1}{N}\sum_{i}\frac{1}{|Y_i|}\sum_{y \in Y_i} r(i, y)} — average accuracy across all candidates
- Failed to parse (syntax error): {\displaystyle \text{cnt\_any} = \frac{1}{N}\sum_{i}\mathbb{1}[\exists y \in Y_i: r(i, y) > 0]} — fraction of puzzles with at least one correct solution