Implementation:Spcl Graph of thoughts Run Benchmark
| Field | Value |
|---|---|
| Pattern Name | Run_Benchmark |
| Type | Implementation (Pattern Doc) |
| Repository | spcl/graph-of-thoughts |
| Source Files | examples/sorting/sorting_032.py (L601-713), examples/keyword_counting/keyword_counting.py (L1321-1439), examples/doc_merge/doc_merge.py (L636-752)
|
| Domains | Benchmarking, Evaluation |
| Related Principle | Principle:Spcl_Graph_of_thoughts_Benchmark_Execution_Pattern |
Overview
This is a Pattern Doc that documents the standardized run() function pattern used across all three GoT examples (sorting, keyword counting, and document merging). Each example independently implements a run() function that follows the same benchmark execution flow. This is not a library API; it is a user-defined convention.
Source References
| Example | File | Lines | Methods Benchmarked |
|---|---|---|---|
| Sorting (32 elements) | examples/sorting/sorting_032.py |
L601-713 | io, cot, tot, tot2, got
|
| Keyword Counting | examples/keyword_counting/keyword_counting.py |
L1321-1439 | io, cot, tot, tot2, got4, got8, gotx
|
| Document Merging | examples/doc_merge/doc_merge.py |
L636-752 | io, cot, tot, got, got2
|
Common Signature
All three run() functions share the same signature:
def run(
data_ids: List[int],
methods: List[Callable[[], operations.GraphOfOperations]],
budget: float,
lm_name: str,
) -> float
Parameters
| Parameter | Type | Description |
|---|---|---|
data_ids |
List[int] |
Indices of the dataset samples to run. If empty or None, all samples are used.
|
methods |
List[Callable] |
List of topology builder functions. Each function returns a GraphOfOperations.
|
budget |
float |
Maximum LLM spending limit in dollars. Execution halts when the budget is depleted. |
lm_name |
str |
Name of the language model to use (e.g., "chatgpt").
|
Returns
| Type | Description |
|---|---|
float |
Total amount spent in dollars across all method/sample executions. |
I/O
- Input:
data_ids(list of sample indices),methods(list of topology builder callables),budget(float, dollars),lm_name(string). - Output: Per-sample JSON files (one per sample per method) in a timestamped results directory, plus a
config.jsonand alog.log. Returns total cost as a float.
Canonical Pattern
The following pseudocode captures the common structure across all three examples:
def run(data_ids, methods, budget, lm_name):
orig_budget = budget
# 1. Load dataset from CSV
data = load_csv(dataset_path)
selected_data = [data[i] for i in data_ids]
# 2. Create timestamped results directory
results_folder = create_results_folder(lm_name, methods)
save_config(results_folder, selected_data, methods, lm_name, budget)
setup_logging(results_folder)
# 3. Create per-method subdirectories
for method in methods:
os.makedirs(os.path.join(results_folder, method.__name__))
# 4. Core benchmark loop
for data in selected_data:
if budget <= 0.0:
break
for method in methods:
if budget <= 0.0:
break
# Fresh LM instance per execution
lm = language_models.ChatGPT(config_path, model_name=lm_name, cache=True)
# Build the topology
operations_graph = method()
# Create controller with domain-specific prompter/parser
executor = controller.Controller(
lm,
operations_graph,
DomainPrompter(),
DomainParser(),
{ # initial problem state
"original": data[1],
"current": "",
"phase": 0,
"method": method.__name__,
},
)
# Execute
try:
executor.run()
except Exception as e:
logging.error(f"Exception: {e}")
# Serialize results
path = os.path.join(results_folder, method.__name__, f"{data[0]}.json")
executor.output_graph(path)
# Track cost
budget -= lm.cost
return orig_budget - budget
Example-Specific Differences
While all three examples follow the same core pattern, there are minor domain-specific differences:
Sorting
# Initial problem state for sorting
{
"original": data[1], # unsorted list as string, e.g. "[3, 7, 0, 2, ...]"
"current": "",
"phase": 0,
"method": method.__name__,
}
The sorting example uses SortingPrompter and SortingParser. The dataset CSV contains: [id, unsorted_list, sorted_list].
Keyword Counting
# Initial problem state for keyword counting
{
"original": data[1], # input text
"ground_truth": data[2], # correct frequency list
"current": "",
"phase": 0,
"method": method.__name__,
}
The keyword counting example additionally computes all_potential_countries from the dataset and passes it to each method function as a parameter. It uses KeywordCountingPrompter and KeywordCountingParser.
Document Merging
# Initial problem state for document merging
{
"documents": [data[2], data[3], data[4], data[5]], # 4 NDA documents
"parts": set(),
"current": "",
"method": method.__name__,
}
The document merging example passes four documents as the input. It also includes additional post-processing to convert set objects to list before JSON serialization, since Python sets are not JSON-serializable:
for operation in operations_graph.operations:
for thought in operation.thoughts:
thought.state["parts"] = list(thought.state["parts"])
executor.output_graph(path)
It uses DocMergePrompter and DocMergeParser.
Output Directory Structure
Each benchmark run produces the following directory structure:
results/
{lm_name}_{method1}-{method2}-..._{timestamp}/
config.json # run configuration
log.log # debug-level execution log
io/
0.json # results for sample 0
1.json # results for sample 1
...
cot/
0.json
1.json
...
got/
0.json
1.json
...
Usage Example
if __name__ == "__main__":
budget = 30 # dollars
samples = list(range(0, 100))
approaches = [io, cot, tot, tot2, got]
spent = run(samples, approaches, budget, "chatgpt")
logging.info(f"Spent {spent} out of {budget} budget.")