Principle:Spcl Graph of thoughts Thought Scoring

Knowledge Sources	Graph of Thoughts Graph of Thoughts: Solving Elaborate Problems with Large Language Models
Domains	Graph_Reasoning, Thought_Operations
Implementations	Implementation:Spcl_Graph_of_thoughts_Score_Operation
Last Updated	2026-02-14

Overview

Operation pattern that assigns numerical quality scores to thoughts using either programmatic functions or LLM-based evaluation.

Scoring is the evaluation mechanism in the Graph of Thoughts framework. While Generate creates candidate thoughts, Score evaluates their quality, enabling the graph to prune poor candidates and focus computational resources on promising directions. Scores are the basis for selection operations like KeepBestN.

Core Concepts

Two Scoring Modes

The Score operation supports two fundamentally different approaches to evaluating thought quality:

Programmatic Scoring

When a scoring_function callable is provided, the operation evaluates thoughts purely through code -- no LLM call is made. The function receives a thought's state dictionary (or a list of state dictionaries for combined scoring) and returns a numerical score (or list of scores).

This mode is preferred when:

The quality metric is computable (e.g., counting errors in a sorted list)
Deterministic, reproducible scoring is required
API cost must be minimized
The evaluation criteria are well-defined and algorithmic

Example: In the sorting task, the scoring function counts the number of elements that are out of order -- a purely algorithmic check that requires no LLM reasoning.

LLM-Based Scoring

When no scoring_function is provided (i.e., it is None), the operation falls back to LLM-based evaluation:

The Prompter constructs a scoring prompt from the thought state(s) via prompter.score_prompt
The Language Model is queried with this prompt (potentially multiple times via num_samples)
The Parser extracts numerical scores from the LLM's text response via parser.parse_score_answer

This mode is appropriate when:

Quality is subjective or requires reasoning (e.g., evaluating writing quality)
The evaluation criteria are difficult to express as code
Human-like judgment is needed

Combined vs. Individual Scoring

The Score operation offers two strategies for scoring multiple thoughts:

Individual Scoring (combined_scoring=False)

Each thought is scored independently in its own LLM call (or function call):

The scoring function/prompt receives one thought state at a time
Each thought gets its own score independently of other thoughts
Requires N LLM calls for N thoughts (when using LLM scoring)

This is the default mode and is appropriate when the quality of each thought can be assessed in isolation.

Combined Scoring (combined_scoring=True)

All thoughts are scored together in a single evaluation:

The scoring function/prompt receives all thought states simultaneously as a list
The function/LLM returns a list of scores -- one for each thought
Requires only 1 LLM call for N thoughts (when using LLM scoring)

This mode is useful when:

Relative comparison between candidates matters (e.g., ranking)
Scoring depends on the distribution of all candidates
Reducing the number of LLM calls is important

Scores and Thought Lifecycle

When a thought is scored, the Score operation:

Creates a new Thought cloned from the original (via Thought.from_thought)
Sets the .score property on the new thought, which also marks it as .scored = True
Appends the new thought to the operation's output

The original thought from the predecessor is not mutated. This immutability pattern ensures that multiple downstream operations can score the same thoughts independently without interference.

Multi-Sample Scoring

The num_samples parameter controls how many times the LLM is queried for each scoring evaluation (only relevant for LLM-based scoring). Multiple samples can be useful for:

Reducing scoring variance by averaging or aggregating multiple LLM judgments
Getting more robust evaluations when LLM scoring is noisy

The responses from all samples are passed together to the Parser's parse_score_answer method, which determines how to combine them into a final score.

Interaction with Other Operations

Score operations appear in these typical patterns:

Score + KeepBestN -- the most common pattern: score all candidates, then keep only the top N. This is the selection mechanism that enables search through the thought space.
Score after Generate -- scores the raw output of a Generate operation to identify promising candidates.
Score after Improve -- evaluates whether an improvement step actually made the thought better.
Score after Aggregate -- evaluates the quality of merged results.

Design Rationale

The dual scoring approach (programmatic vs. LLM-based) reflects a practical insight: many research tasks have well-defined metrics (sorting correctness, set intersection accuracy) where LLM evaluation would be wasteful. By supporting both modes behind the same interface, the framework lets users optimize for cost and accuracy on a per-task basis.

The combined scoring option addresses the efficiency concern that scoring N thoughts individually requires N LLM calls, which can be expensive for large candidate sets. Combined scoring trades some scoring independence for a significant reduction in API costs.

Related Pages

GitHub URL

graph_of_thoughts/operations/operations.py

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment