Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Spcl Graph of thoughts Benchmark Execution Pattern

From Leeroopedia
Field Value
Pattern Name Benchmark_Execution_Pattern
Type Principle
Repository spcl/graph-of-thoughts
Domains Benchmarking, Evaluation
Sources GoT Paper (arXiv:2308.09687)
Related Implementations Implementation:Spcl_Graph_of_thoughts_Run_Benchmark

Overview

Standardized pattern for benchmarking multiple reasoning approaches (IO, CoT, ToT, GoT) against datasets with budget constraints and result visualization.

Description

All GoT examples (sorting, keyword counting, document merging) follow the same benchmark execution pattern. This is a user-defined convention, not a library API -- each example independently implements a run() function that follows the same structure. The pattern ensures consistent, reproducible benchmarking across different problem domains and reasoning approaches.

The Standard Benchmark Flow

The benchmark pattern consists of six phases:

1. Load Dataset

The dataset is loaded from a CSV file located alongside the example script. Each row contains an integer identifier, an input instance, and (optionally) the ground truth answer.

data_path = os.path.join(os.path.dirname(__file__), "sorting_032.csv")
data = []
with open(data_path, "r") as f:
    reader = csv.reader(f)
    next(reader)  # skip header
    for row in reader:
        data.append([int(row[0]), row[1], row[2]])

2. Define Graph Builder Functions

Each reasoning approach (IO, CoT, ToT, GoT, etc.) is represented by a Python function that takes no arguments (or domain-specific arguments) and returns a GraphOfOperations instance. These functions are the topology builders described in the GoT Graph Topology Design principle.

approaches = [io, cot, tot, tot2, got]

3. Set Up Results Directory

A timestamped results directory is created, with subdirectories for each method. A configuration JSON file is also saved.

timestamp = datetime.datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
extra_info = f"{lm_name}_{'-'.join([method.__name__ for method in methods])}"
folder_name = f"{extra_info}_{timestamp}"
results_folder = os.path.join(results_dir, folder_name)
os.makedirs(results_folder)

4. Iterate Over Samples and Methods

The core benchmark loop iterates over each data sample and each method. For every combination, it:

  • Creates a fresh language model instance (with caching enabled)
  • Builds the graph topology by calling the method function
  • Creates a Controller with the topology, the domain-specific prompter and parser, and the initial problem state
  • Runs the controller to execute the reasoning graph
  • Serializes results to a per-sample JSON file via Controller.output_graph()

5. Track Budget

Before each method execution, the remaining budget is checked. If the cumulative LLM cost exceeds the budget (specified in dollars), execution stops. This prevents runaway costs during benchmarking.

if budget <= 0.0:
    logging.error(f"Budget has been depleted, stopping.")
    break
# ... execute method ...
budget -= lm.cost

6. Return Total Spend

The run() function returns the total amount spent (original budget minus remaining budget), allowing the caller to track costs.

Key Design Choices

  • Fresh LM per method execution: A new language model instance is created for each method call. This ensures that cost tracking is per-method and that caching behavior is isolated.
  • Exception handling: Each controller.run() call is wrapped in a try/except block so that a single failure does not abort the entire benchmark.
  • Logging: All examples configure file-based logging in the results directory, capturing debug-level information about every LLM call and operation execution.
  • Deterministic file naming: Result files are named {sample_id}.json within method-specific subdirectories, making it easy to programmatically aggregate results.

Usage

Use this pattern when benchmarking a new GoT problem domain. The steps are:

  1. Prepare a dataset in CSV format with an integer ID column, an input column, and a ground truth column.
  2. Implement domain-specific Prompter and Parser subclasses.
  3. Define topology builder functions for each reasoning approach to benchmark (IO, CoT, ToT, GoT, etc.).
  4. Implement a run() function following the standard benchmark flow.
  5. Execute with a budget cap and desired sample indices.
  6. Analyze the resulting JSON files (optionally using a plot script).
if __name__ == "__main__":
    budget = 30  # dollars
    samples = list(range(0, 100))
    approaches = [io, cot, tot, tot2, got]

    spent = run(samples, approaches, budget, "chatgpt")
    logging.info(f"Spent {spent} out of {budget} budget.")

Sources

Related

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment