Principle:Spcl Graph of thoughts Ground Truth Evaluation
| Knowledge Sources | |
|---|---|
| Domains | Graph_Reasoning, Evaluation |
| Last Updated | 2026-02-14 |
| Implemented By | Implementation:Spcl_Graph_of_thoughts_GroundTruth_Operation |
Overview
Evaluation pattern that checks whether thoughts correctly solve the problem by comparing against a ground truth evaluator function.
Description
The GroundTruth operation is an evaluation mechanism in the Graph of Thoughts framework that assesses whether thoughts have correctly solved the problem. It uses a callable evaluator function that takes a thought's state dictionary and returns a boolean indicating whether the thought represents a correct solution.
Key characteristics:
- Takes a ground_truth_evaluator callable with signature
(state: Dict) -> boolthat defines the correctness criterion - Sets the
thought.solvedflag on each thought based on the evaluator's result, which also automatically setscompared_to_ground_truth = True - Creates cloned copies of predecessor thoughts via
Thought.from_thoughtbefore setting the solved flag - Includes exception handling: if the evaluator raises any exception, the thought is marked as
solved = False(prevents pipeline crashes from malformed states) - Requires at least one predecessor operation (enforced by assertion)
- Does not interact with the language model -- it is a pure evaluation operation using the provided function
- Logs the total number of evaluated thoughts and how many solved the problem
Usage
Use the GroundTruth operation at the end of a reasoning pipeline to evaluate the quality of the final output against a known correct answer. This is essential for:
- Benchmarking: Measuring the accuracy of different reasoning strategies (IO, CoT, ToT, GoT) on the same problem set
- Evaluation pipelines: Automatically checking whether the LM's reasoning process arrived at the correct answer
- Quality reporting: The solved flag is captured in the serialized execution graph for offline analysis
The evaluator function is domain-specific and must be provided by the user. Examples include:
- Sorting: Check if the output list is correctly sorted and matches the ground truth
- Set intersection: Verify that the computed intersection matches the expected result
- Keyword counting: Confirm that the count matches the actual number of occurrences
Theoretical Basis
The GroundTruth operation serves as the terminal evaluation vertex in a GoT reasoning graph. It does not participate in the reasoning process itself but rather provides an external oracle for assessing the quality of the reasoning output.
In the GoT paper, the authors benchmark their framework by comparing the output of different reasoning strategies against known correct answers. The GroundTruth operation formalizes this comparison as a first-class operation in the graph, enabling:
- Automated evaluation: No manual inspection needed; the pipeline self-reports its accuracy
- Batch processing: Run hundreds or thousands of problems and aggregate the solved/unsolved statistics
- Fair comparison: All reasoning strategies (IO, CoT, ToT, GoT) go through the same evaluation operation
The separation of evaluation from reasoning follows the principle of separation of concerns: the reasoning operations (Generate, Aggregate, Improve, etc.) focus on producing solutions, while GroundTruth independently assesses their correctness. This design allows the same evaluation function to be reused across different reasoning topologies.
Code Reference
The Ground Truth Evaluation principle is implemented in the GroundTruth class:
- Source file:
graph_of_thoughts/operations/operations.py, Lines 776-837 - Class:
GroundTruth(Operation) - Operation type:
OperationType.ground_truth_evaluator
Related Pages
- Implementation:Spcl_Graph_of_thoughts_GroundTruth_Operation - Concrete implementation of this principle
- Principle:Spcl_Graph_of_thoughts_Best_N_Selection - KeepBestN is often the last operation before GroundTruth evaluation
- Workflow:Spcl_Graph_of_thoughts_GoT_Sorting_Pipeline - Sorting benchmark pipeline with GroundTruth evaluation
- Workflow:Spcl_Graph_of_thoughts_GoT_Keyword_Counting_Pipeline - Keyword counting pipeline with GroundTruth evaluation