Heuristic:Spcl Graph of thoughts Scoring With Error Counting
| Knowledge Sources | |
|---|---|
| Domains | LLM_Reasoning, Optimization |
| Last Updated | 2026-02-14 03:30 GMT |
Overview
Use deterministic error-counting scoring functions instead of LLM-based scoring for reliable thought evaluation in GoT pipelines.
Description
The Score operation in Graph of Thoughts supports two scoring modes: LLM-prompted scoring (where the LLM evaluates solution quality) and callback-based scoring (where a Python function computes a deterministic score). The benchmark examples exclusively use callback-based scoring with error-counting functions. These functions compare the current thought state against either the ground truth or the expected properties, returning the count of errors as the score. This approach is lower-is-better: a score of 0 means a perfect solution, and `KeepBestN` uses `higher_is_better=False`.
Usage
Use this heuristic when:
- The problem has a deterministic correctness criterion (e.g., sorted order, exact frequency counts)
- You need reproducible scoring that does not depend on LLM stochasticity
- You want to minimize API costs by avoiding additional LLM calls for scoring
- The score is used with `KeepBestN` to select the best candidate from multiple generated solutions
The Insight (Rule of Thumb)
- Action: Pass a `scoring_function` callback to the `Score` operation instead of relying on LLM scoring.
- Value: Use error count (lower is better) with `KeepBestN(n, higher_is_better=False)`.
- Trade-off: Deterministic but requires domain knowledge to write the scoring function. Cannot evaluate subjective quality.
- Fallback: The scoring function should return a high penalty (e.g., 300) on exceptions to push unparseable results to the bottom of the ranking.
Reasoning
LLM-based scoring adds latency, cost, and non-determinism. For problems with well-defined correctness criteria, a deterministic scoring function is superior because:
- Zero additional API cost — no extra LLM calls needed for scoring
- Consistent results — same input always produces same score
- Fine-grained — can count exact number of errors, not just binary correct/incorrect
- Robust — exception handling returns high penalty rather than crashing
Code Evidence
Sorting error-counting function from `examples/sorting/utils.py:46-78`:
def num_errors(state: Dict) -> float:
try:
unsorted_list = state["original"]
if (
"unsorted_sublist" in state
and state["unsorted_sublist"] != ""
and state["unsorted_sublist"] is not None
and len(state["unsorted_sublist"]) < len(unsorted_list) - 5
):
unsorted_list = state["unsorted_sublist"]
correct_list = sorted(string_to_list(unsorted_list))
current_list = string_to_list(state["current"])
num_errors = 0
for i in range(10):
num_errors += abs(
sum([1 for num in current_list if num == i])
- sum([1 for num in correct_list if num == i])
)
num_errors += sum(
[1 for num1, num2 in zip(current_list, current_list[1:]) if num1 > num2]
)
return num_errors
except:
return 300
Score operation usage with callback from `examples/sorting/sorting_032.py:474`:
operations_graph.append_operation(operations.Score(1, False, utils.num_errors))
KeepBestN with lower-is-better from `examples/sorting/sorting_032.py:508`:
keep_best_1 = operations.KeepBestN(1, False)
KeepBestN error handling fallback from `graph_of_thoughts/operations/operations.py:649-668`:
try:
return sorted(
previous_thoughts,
key=lambda thought: thought.score,
reverse=self.higher_is_better,
)[: self.n]
except:
self.logger.error("Error in KeepBestN operation")
return sorted(
[i for i in previous_thoughts if isinstance(i.score, float)],
key=lambda thought: thought.score,
reverse=self.higher_is_better,
)[: self.n]