Principle:Spcl Graph of thoughts Thought Scoring
| Knowledge Sources | |
|---|---|
| Domains | Graph_Reasoning, Thought_Operations |
| Implementations | Implementation:Spcl_Graph_of_thoughts_Score_Operation |
| Last Updated | 2026-02-14 |
Overview
Operation pattern that assigns numerical quality scores to thoughts using either programmatic functions or LLM-based evaluation.
Scoring is the evaluation mechanism in the Graph of Thoughts framework. While Generate creates candidate thoughts, Score evaluates their quality, enabling the graph to prune poor candidates and focus computational resources on promising directions. Scores are the basis for selection operations like KeepBestN.
Core Concepts
Two Scoring Modes
The Score operation supports two fundamentally different approaches to evaluating thought quality:
Programmatic Scoring
When a scoring_function callable is provided, the operation evaluates thoughts purely through code -- no LLM call is made. The function receives a thought's state dictionary (or a list of state dictionaries for combined scoring) and returns a numerical score (or list of scores).
This mode is preferred when:
- The quality metric is computable (e.g., counting errors in a sorted list)
- Deterministic, reproducible scoring is required
- API cost must be minimized
- The evaluation criteria are well-defined and algorithmic
Example: In the sorting task, the scoring function counts the number of elements that are out of order -- a purely algorithmic check that requires no LLM reasoning.
LLM-Based Scoring
When no scoring_function is provided (i.e., it is None), the operation falls back to LLM-based evaluation:
- The Prompter constructs a scoring prompt from the thought state(s) via
prompter.score_prompt - The Language Model is queried with this prompt (potentially multiple times via
num_samples) - The Parser extracts numerical scores from the LLM's text response via
parser.parse_score_answer
This mode is appropriate when:
- Quality is subjective or requires reasoning (e.g., evaluating writing quality)
- The evaluation criteria are difficult to express as code
- Human-like judgment is needed
Combined vs. Individual Scoring
The Score operation offers two strategies for scoring multiple thoughts:
Individual Scoring (combined_scoring=False)
Each thought is scored independently in its own LLM call (or function call):
- The scoring function/prompt receives one thought state at a time
- Each thought gets its own score independently of other thoughts
- Requires N LLM calls for N thoughts (when using LLM scoring)
This is the default mode and is appropriate when the quality of each thought can be assessed in isolation.
Combined Scoring (combined_scoring=True)
All thoughts are scored together in a single evaluation:
- The scoring function/prompt receives all thought states simultaneously as a list
- The function/LLM returns a list of scores -- one for each thought
- Requires only 1 LLM call for N thoughts (when using LLM scoring)
This mode is useful when:
- Relative comparison between candidates matters (e.g., ranking)
- Scoring depends on the distribution of all candidates
- Reducing the number of LLM calls is important
Scores and Thought Lifecycle
When a thought is scored, the Score operation:
- Creates a new Thought cloned from the original (via
Thought.from_thought) - Sets the
.scoreproperty on the new thought, which also marks it as.scored = True - Appends the new thought to the operation's output
The original thought from the predecessor is not mutated. This immutability pattern ensures that multiple downstream operations can score the same thoughts independently without interference.
Multi-Sample Scoring
The num_samples parameter controls how many times the LLM is queried for each scoring evaluation (only relevant for LLM-based scoring). Multiple samples can be useful for:
- Reducing scoring variance by averaging or aggregating multiple LLM judgments
- Getting more robust evaluations when LLM scoring is noisy
The responses from all samples are passed together to the Parser's parse_score_answer method, which determines how to combine them into a final score.
Interaction with Other Operations
Score operations appear in these typical patterns:
- Score + KeepBestN -- the most common pattern: score all candidates, then keep only the top N. This is the selection mechanism that enables search through the thought space.
- Score after Generate -- scores the raw output of a Generate operation to identify promising candidates.
- Score after Improve -- evaluates whether an improvement step actually made the thought better.
- Score after Aggregate -- evaluates the quality of merged results.
Design Rationale
The dual scoring approach (programmatic vs. LLM-based) reflects a practical insight: many research tasks have well-defined metrics (sorting correctness, set intersection accuracy) where LLM evaluation would be wasteful. By supporting both modes behind the same interface, the framework lets users optimize for cost and accuracy on a per-task basis.
The combined scoring option addresses the efficiency concern that scoring N thoughts individually requires N LLM calls, which can be expensive for large candidate sets. Combined scoring trades some scoring independence for a significant reduction in API costs.
Related Pages
- Implementation:Spcl_Graph_of_thoughts_Score_Operation
- Heuristic:Spcl_Graph_of_thoughts_Scoring_With_Error_Counting