Principle:Spcl Graph of thoughts Ground Truth Evaluation

Knowledge Sources	Graph of Thoughts Graph of Thoughts: Solving Elaborate Problems with Large Language Models
Domains	Graph_Reasoning, Evaluation
Last Updated	2026-02-14
Implemented By	Implementation:Spcl_Graph_of_thoughts_GroundTruth_Operation

Overview

Evaluation pattern that checks whether thoughts correctly solve the problem by comparing against a ground truth evaluator function.

Description

The GroundTruth operation is an evaluation mechanism in the Graph of Thoughts framework that assesses whether thoughts have correctly solved the problem. It uses a callable evaluator function that takes a thought's state dictionary and returns a boolean indicating whether the thought represents a correct solution.

Key characteristics:

Takes a ground_truth_evaluator callable with signature (state: Dict) -> bool that defines the correctness criterion
Sets the thought.solved flag on each thought based on the evaluator's result, which also automatically sets compared_to_ground_truth = True
Creates cloned copies of predecessor thoughts via Thought.from_thought before setting the solved flag
Includes exception handling: if the evaluator raises any exception, the thought is marked as solved = False (prevents pipeline crashes from malformed states)
Requires at least one predecessor operation (enforced by assertion)
Does not interact with the language model -- it is a pure evaluation operation using the provided function
Logs the total number of evaluated thoughts and how many solved the problem

Usage

Use the GroundTruth operation at the end of a reasoning pipeline to evaluate the quality of the final output against a known correct answer. This is essential for:

Benchmarking: Measuring the accuracy of different reasoning strategies (IO, CoT, ToT, GoT) on the same problem set
Evaluation pipelines: Automatically checking whether the LM's reasoning process arrived at the correct answer
Quality reporting: The solved flag is captured in the serialized execution graph for offline analysis

The evaluator function is domain-specific and must be provided by the user. Examples include:

Sorting: Check if the output list is correctly sorted and matches the ground truth
Set intersection: Verify that the computed intersection matches the expected result
Keyword counting: Confirm that the count matches the actual number of occurrences

Theoretical Basis

The GroundTruth operation serves as the terminal evaluation vertex in a GoT reasoning graph. It does not participate in the reasoning process itself but rather provides an external oracle for assessing the quality of the reasoning output.

In the GoT paper, the authors benchmark their framework by comparing the output of different reasoning strategies against known correct answers. The GroundTruth operation formalizes this comparison as a first-class operation in the graph, enabling:

Automated evaluation: No manual inspection needed; the pipeline self-reports its accuracy
Batch processing: Run hundreds or thousands of problems and aggregate the solved/unsolved statistics
Fair comparison: All reasoning strategies (IO, CoT, ToT, GoT) go through the same evaluation operation

The separation of evaluation from reasoning follows the principle of separation of concerns: the reasoning operations (Generate, Aggregate, Improve, etc.) focus on producing solutions, while GroundTruth independently assesses their correctness. This design allows the same evaluation function to be reused across different reasoning topologies.

Code Reference

The Ground Truth Evaluation principle is implemented in the GroundTruth class:

Source file: graph_of_thoughts/operations/operations.py, Lines 776-837
Class: GroundTruth(Operation)
Operation type: OperationType.ground_truth_evaluator

Related Pages

Implementation:Spcl_Graph_of_thoughts_GroundTruth_Operation - Concrete implementation of this principle
Principle:Spcl_Graph_of_thoughts_Best_N_Selection - KeepBestN is often the last operation before GroundTruth evaluation
Workflow:Spcl_Graph_of_thoughts_GoT_Sorting_Pipeline - Sorting benchmark pipeline with GroundTruth evaluation
Workflow:Spcl_Graph_of_thoughts_GoT_Keyword_Counting_Pipeline - Keyword counting pipeline with GroundTruth evaluation

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment