Heuristic:Spcl Graph of thoughts Scoring With Error Counting

Knowledge Sources	Graph of Thoughts
Domains	LLM_Reasoning, Optimization
Last Updated	2026-02-14 03:30 GMT

Overview

Use deterministic error-counting scoring functions instead of LLM-based scoring for reliable thought evaluation in GoT pipelines.

Description

The Score operation in Graph of Thoughts supports two scoring modes: LLM-prompted scoring (where the LLM evaluates solution quality) and callback-based scoring (where a Python function computes a deterministic score). The benchmark examples exclusively use callback-based scoring with error-counting functions. These functions compare the current thought state against either the ground truth or the expected properties, returning the count of errors as the score. This approach is lower-is-better: a score of 0 means a perfect solution, and `KeepBestN` uses `higher_is_better=False`.

Usage

Use this heuristic when:

The problem has a deterministic correctness criterion (e.g., sorted order, exact frequency counts)
You need reproducible scoring that does not depend on LLM stochasticity
You want to minimize API costs by avoiding additional LLM calls for scoring
The score is used with `KeepBestN` to select the best candidate from multiple generated solutions

The Insight (Rule of Thumb)

Action: Pass a `scoring_function` callback to the `Score` operation instead of relying on LLM scoring.
Value: Use error count (lower is better) with `KeepBestN(n, higher_is_better=False)`.
Trade-off: Deterministic but requires domain knowledge to write the scoring function. Cannot evaluate subjective quality.
Fallback: The scoring function should return a high penalty (e.g., 300) on exceptions to push unparseable results to the bottom of the ranking.

Reasoning

LLM-based scoring adds latency, cost, and non-determinism. For problems with well-defined correctness criteria, a deterministic scoring function is superior because:

Zero additional API cost — no extra LLM calls needed for scoring
Consistent results — same input always produces same score
Fine-grained — can count exact number of errors, not just binary correct/incorrect
Robust — exception handling returns high penalty rather than crashing

Code Evidence

Sorting error-counting function from `examples/sorting/utils.py:46-78`:

def num_errors(state: Dict) -> float:
    try:
        unsorted_list = state["original"]
        if (
            "unsorted_sublist" in state
            and state["unsorted_sublist"] != ""
            and state["unsorted_sublist"] is not None
            and len(state["unsorted_sublist"]) < len(unsorted_list) - 5
        ):
            unsorted_list = state["unsorted_sublist"]
        correct_list = sorted(string_to_list(unsorted_list))
        current_list = string_to_list(state["current"])
        num_errors = 0
        for i in range(10):
            num_errors += abs(
                sum([1 for num in current_list if num == i])
                - sum([1 for num in correct_list if num == i])
            )
        num_errors += sum(
            [1 for num1, num2 in zip(current_list, current_list[1:]) if num1 > num2]
        )
        return num_errors
    except:
        return 300

Score operation usage with callback from `examples/sorting/sorting_032.py:474`:

operations_graph.append_operation(operations.Score(1, False, utils.num_errors))

KeepBestN with lower-is-better from `examples/sorting/sorting_032.py:508`:

keep_best_1 = operations.KeepBestN(1, False)

KeepBestN error handling fallback from `graph_of_thoughts/operations/operations.py:649-668`:

try:
    return sorted(
        previous_thoughts,
        key=lambda thought: thought.score,
        reverse=self.higher_is_better,
    )[: self.n]
except:
    self.logger.error("Error in KeepBestN operation")
    return sorted(
        [i for i in previous_thoughts if isinstance(i.score, float)],
        key=lambda thought: thought.score,
        reverse=self.higher_is_better,
    )[: self.n]

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment