Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Spcl Graph of thoughts Scoring With Error Counting

From Leeroopedia
Knowledge Sources
Domains LLM_Reasoning, Optimization
Last Updated 2026-02-14 03:30 GMT

Overview

Use deterministic error-counting scoring functions instead of LLM-based scoring for reliable thought evaluation in GoT pipelines.

Description

The Score operation in Graph of Thoughts supports two scoring modes: LLM-prompted scoring (where the LLM evaluates solution quality) and callback-based scoring (where a Python function computes a deterministic score). The benchmark examples exclusively use callback-based scoring with error-counting functions. These functions compare the current thought state against either the ground truth or the expected properties, returning the count of errors as the score. This approach is lower-is-better: a score of 0 means a perfect solution, and `KeepBestN` uses `higher_is_better=False`.

Usage

Use this heuristic when:

  • The problem has a deterministic correctness criterion (e.g., sorted order, exact frequency counts)
  • You need reproducible scoring that does not depend on LLM stochasticity
  • You want to minimize API costs by avoiding additional LLM calls for scoring
  • The score is used with `KeepBestN` to select the best candidate from multiple generated solutions

The Insight (Rule of Thumb)

  • Action: Pass a `scoring_function` callback to the `Score` operation instead of relying on LLM scoring.
  • Value: Use error count (lower is better) with `KeepBestN(n, higher_is_better=False)`.
  • Trade-off: Deterministic but requires domain knowledge to write the scoring function. Cannot evaluate subjective quality.
  • Fallback: The scoring function should return a high penalty (e.g., 300) on exceptions to push unparseable results to the bottom of the ranking.

Reasoning

LLM-based scoring adds latency, cost, and non-determinism. For problems with well-defined correctness criteria, a deterministic scoring function is superior because:

  1. Zero additional API cost — no extra LLM calls needed for scoring
  2. Consistent results — same input always produces same score
  3. Fine-grained — can count exact number of errors, not just binary correct/incorrect
  4. Robust — exception handling returns high penalty rather than crashing

Code Evidence

Sorting error-counting function from `examples/sorting/utils.py:46-78`:

def num_errors(state: Dict) -> float:
    try:
        unsorted_list = state["original"]
        if (
            "unsorted_sublist" in state
            and state["unsorted_sublist"] != ""
            and state["unsorted_sublist"] is not None
            and len(state["unsorted_sublist"]) < len(unsorted_list) - 5
        ):
            unsorted_list = state["unsorted_sublist"]
        correct_list = sorted(string_to_list(unsorted_list))
        current_list = string_to_list(state["current"])
        num_errors = 0
        for i in range(10):
            num_errors += abs(
                sum([1 for num in current_list if num == i])
                - sum([1 for num in correct_list if num == i])
            )
        num_errors += sum(
            [1 for num1, num2 in zip(current_list, current_list[1:]) if num1 > num2]
        )
        return num_errors
    except:
        return 300

Score operation usage with callback from `examples/sorting/sorting_032.py:474`:

operations_graph.append_operation(operations.Score(1, False, utils.num_errors))

KeepBestN with lower-is-better from `examples/sorting/sorting_032.py:508`:

keep_best_1 = operations.KeepBestN(1, False)

KeepBestN error handling fallback from `graph_of_thoughts/operations/operations.py:649-668`:

try:
    return sorted(
        previous_thoughts,
        key=lambda thought: thought.score,
        reverse=self.higher_is_better,
    )[: self.n]
except:
    self.logger.error("Error in KeepBestN operation")
    return sorted(
        [i for i in previous_thoughts if isinstance(i.score, float)],
        key=lambda thought: thought.score,
        reverse=self.higher_is_better,
    )[: self.n]

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment