Principle:Spcl Graph of thoughts Benchmark Execution Pattern
| Field | Value |
|---|---|
| Pattern Name | Benchmark_Execution_Pattern |
| Type | Principle |
| Repository | spcl/graph-of-thoughts |
| Domains | Benchmarking, Evaluation |
| Sources | GoT Paper (arXiv:2308.09687) |
| Related Implementations | Implementation:Spcl_Graph_of_thoughts_Run_Benchmark |
Overview
Standardized pattern for benchmarking multiple reasoning approaches (IO, CoT, ToT, GoT) against datasets with budget constraints and result visualization.
Description
All GoT examples (sorting, keyword counting, document merging) follow the same benchmark execution pattern. This is a user-defined convention, not a library API -- each example independently implements a run() function that follows the same structure. The pattern ensures consistent, reproducible benchmarking across different problem domains and reasoning approaches.
The Standard Benchmark Flow
The benchmark pattern consists of six phases:
1. Load Dataset
The dataset is loaded from a CSV file located alongside the example script. Each row contains an integer identifier, an input instance, and (optionally) the ground truth answer.
data_path = os.path.join(os.path.dirname(__file__), "sorting_032.csv")
data = []
with open(data_path, "r") as f:
reader = csv.reader(f)
next(reader) # skip header
for row in reader:
data.append([int(row[0]), row[1], row[2]])
2. Define Graph Builder Functions
Each reasoning approach (IO, CoT, ToT, GoT, etc.) is represented by a Python function that takes no arguments (or domain-specific arguments) and returns a GraphOfOperations instance. These functions are the topology builders described in the GoT Graph Topology Design principle.
approaches = [io, cot, tot, tot2, got]
3. Set Up Results Directory
A timestamped results directory is created, with subdirectories for each method. A configuration JSON file is also saved.
timestamp = datetime.datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
extra_info = f"{lm_name}_{'-'.join([method.__name__ for method in methods])}"
folder_name = f"{extra_info}_{timestamp}"
results_folder = os.path.join(results_dir, folder_name)
os.makedirs(results_folder)
4. Iterate Over Samples and Methods
The core benchmark loop iterates over each data sample and each method. For every combination, it:
- Creates a fresh language model instance (with caching enabled)
- Builds the graph topology by calling the method function
- Creates a
Controllerwith the topology, the domain-specific prompter and parser, and the initial problem state - Runs the controller to execute the reasoning graph
- Serializes results to a per-sample JSON file via
Controller.output_graph()
5. Track Budget
Before each method execution, the remaining budget is checked. If the cumulative LLM cost exceeds the budget (specified in dollars), execution stops. This prevents runaway costs during benchmarking.
if budget <= 0.0:
logging.error(f"Budget has been depleted, stopping.")
break
# ... execute method ...
budget -= lm.cost
6. Return Total Spend
The run() function returns the total amount spent (original budget minus remaining budget), allowing the caller to track costs.
Key Design Choices
- Fresh LM per method execution: A new language model instance is created for each method call. This ensures that cost tracking is per-method and that caching behavior is isolated.
- Exception handling: Each
controller.run()call is wrapped in a try/except block so that a single failure does not abort the entire benchmark. - Logging: All examples configure file-based logging in the results directory, capturing debug-level information about every LLM call and operation execution.
- Deterministic file naming: Result files are named
{sample_id}.jsonwithin method-specific subdirectories, making it easy to programmatically aggregate results.
Usage
Use this pattern when benchmarking a new GoT problem domain. The steps are:
- Prepare a dataset in CSV format with an integer ID column, an input column, and a ground truth column.
- Implement domain-specific
PrompterandParsersubclasses. - Define topology builder functions for each reasoning approach to benchmark (IO, CoT, ToT, GoT, etc.).
- Implement a
run()function following the standard benchmark flow. - Execute with a budget cap and desired sample indices.
- Analyze the resulting JSON files (optionally using a plot script).
if __name__ == "__main__":
budget = 30 # dollars
samples = list(range(0, 100))
approaches = [io, cot, tot, tot2, got]
spent = run(samples, approaches, budget, "chatgpt")
logging.info(f"Spent {spent} out of {budget} budget.")
Sources
- Besta, M. et al. "Graph of Thoughts: Solving Elaborate Problems with Large Language Models." arXiv:2308.09687, 2023.
- spcl/graph-of-thoughts GitHub repository