Implementation:Spcl Graph of thoughts Run Benchmark

Field	Value
Pattern Name	Run_Benchmark
Type	Implementation (Pattern Doc)
Repository	spcl/graph-of-thoughts
Source Files	`examples/sorting/sorting_032.py` (L601-713), `examples/keyword_counting/keyword_counting.py` (L1321-1439), `examples/doc_merge/doc_merge.py` (L636-752)
Domains	Benchmarking, Evaluation
Related Principle	Principle:Spcl_Graph_of_thoughts_Benchmark_Execution_Pattern

Overview

This is a Pattern Doc that documents the standardized run() function pattern used across all three GoT examples (sorting, keyword counting, and document merging). Each example independently implements a run() function that follows the same benchmark execution flow. This is not a library API; it is a user-defined convention.

Source References

Example	File	Lines	Methods Benchmarked
Sorting (32 elements)	`examples/sorting/sorting_032.py`	L601-713	`io`, `cot`, `tot`, `tot2`, `got`
Keyword Counting	`examples/keyword_counting/keyword_counting.py`	L1321-1439	`io`, `cot`, `tot`, `tot2`, `got4`, `got8`, `gotx`
Document Merging	`examples/doc_merge/doc_merge.py`	L636-752	`io`, `cot`, `tot`, `got`, `got2`

Common Signature

All three run() functions share the same signature:

def run(
    data_ids: List[int],
    methods: List[Callable[[], operations.GraphOfOperations]],
    budget: float,
    lm_name: str,
) -> float

Parameters

Parameter	Type	Description
`data_ids`	`List[int]`	Indices of the dataset samples to run. If empty or `None`, all samples are used.
`methods`	`List[Callable]`	List of topology builder functions. Each function returns a `GraphOfOperations`.
`budget`	`float`	Maximum LLM spending limit in dollars. Execution halts when the budget is depleted.
`lm_name`	`str`	Name of the language model to use (e.g., `"chatgpt"`).

Returns

Type	Description
`float`	Total amount spent in dollars across all method/sample executions.

I/O

Input: data_ids (list of sample indices), methods (list of topology builder callables), budget (float, dollars), lm_name (string).
Output: Per-sample JSON files (one per sample per method) in a timestamped results directory, plus a config.json and a log.log. Returns total cost as a float.

Canonical Pattern

The following pseudocode captures the common structure across all three examples:

def run(data_ids, methods, budget, lm_name):
    orig_budget = budget

    # 1. Load dataset from CSV
    data = load_csv(dataset_path)
    selected_data = [data[i] for i in data_ids]

    # 2. Create timestamped results directory
    results_folder = create_results_folder(lm_name, methods)
    save_config(results_folder, selected_data, methods, lm_name, budget)
    setup_logging(results_folder)

    # 3. Create per-method subdirectories
    for method in methods:
        os.makedirs(os.path.join(results_folder, method.__name__))

    # 4. Core benchmark loop
    for data in selected_data:
        if budget <= 0.0:
            break
        for method in methods:
            if budget <= 0.0:
                break

            # Fresh LM instance per execution
            lm = language_models.ChatGPT(config_path, model_name=lm_name, cache=True)

            # Build the topology
            operations_graph = method()

            # Create controller with domain-specific prompter/parser
            executor = controller.Controller(
                lm,
                operations_graph,
                DomainPrompter(),
                DomainParser(),
                {  # initial problem state
                    "original": data[1],
                    "current": "",
                    "phase": 0,
                    "method": method.__name__,
                },
            )

            # Execute
            try:
                executor.run()
            except Exception as e:
                logging.error(f"Exception: {e}")

            # Serialize results
            path = os.path.join(results_folder, method.__name__, f"{data[0]}.json")
            executor.output_graph(path)

            # Track cost
            budget -= lm.cost

    return orig_budget - budget

Example-Specific Differences

While all three examples follow the same core pattern, there are minor domain-specific differences:

Sorting

# Initial problem state for sorting
{
    "original": data[1],   # unsorted list as string, e.g. "[3, 7, 0, 2, ...]"
    "current": "",
    "phase": 0,
    "method": method.__name__,
}

The sorting example uses SortingPrompter and SortingParser. The dataset CSV contains: [id, unsorted_list, sorted_list].

Keyword Counting

# Initial problem state for keyword counting
{
    "original": data[1],       # input text
    "ground_truth": data[2],   # correct frequency list
    "current": "",
    "phase": 0,
    "method": method.__name__,
}

The keyword counting example additionally computes all_potential_countries from the dataset and passes it to each method function as a parameter. It uses KeywordCountingPrompter and KeywordCountingParser.

Document Merging

# Initial problem state for document merging
{
    "documents": [data[2], data[3], data[4], data[5]],  # 4 NDA documents
    "parts": set(),
    "current": "",
    "method": method.__name__,
}

The document merging example passes four documents as the input. It also includes additional post-processing to convert set objects to list before JSON serialization, since Python sets are not JSON-serializable:

for operation in operations_graph.operations:
    for thought in operation.thoughts:
        thought.state["parts"] = list(thought.state["parts"])
executor.output_graph(path)

It uses DocMergePrompter and DocMergeParser.

Output Directory Structure

Each benchmark run produces the following directory structure:

results/
  {lm_name}_{method1}-{method2}-..._{timestamp}/
    config.json          # run configuration
    log.log              # debug-level execution log
    io/
      0.json             # results for sample 0
      1.json             # results for sample 1
      ...
    cot/
      0.json
      1.json
      ...
    got/
      0.json
      1.json
      ...

Usage Example

if __name__ == "__main__":
    budget = 30  # dollars
    samples = list(range(0, 100))
    approaches = [io, cot, tot, tot2, got]

    spent = run(samples, approaches, budget, "chatgpt")
    logging.info(f"Spent {spent} out of {budget} budget.")

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment