Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Spcl Graph of thoughts Run Benchmark

From Leeroopedia
Field Value
Pattern Name Run_Benchmark
Type Implementation (Pattern Doc)
Repository spcl/graph-of-thoughts
Source Files examples/sorting/sorting_032.py (L601-713), examples/keyword_counting/keyword_counting.py (L1321-1439), examples/doc_merge/doc_merge.py (L636-752)
Domains Benchmarking, Evaluation
Related Principle Principle:Spcl_Graph_of_thoughts_Benchmark_Execution_Pattern

Overview

This is a Pattern Doc that documents the standardized run() function pattern used across all three GoT examples (sorting, keyword counting, and document merging). Each example independently implements a run() function that follows the same benchmark execution flow. This is not a library API; it is a user-defined convention.

Source References

Example File Lines Methods Benchmarked
Sorting (32 elements) examples/sorting/sorting_032.py L601-713 io, cot, tot, tot2, got
Keyword Counting examples/keyword_counting/keyword_counting.py L1321-1439 io, cot, tot, tot2, got4, got8, gotx
Document Merging examples/doc_merge/doc_merge.py L636-752 io, cot, tot, got, got2

Common Signature

All three run() functions share the same signature:

def run(
    data_ids: List[int],
    methods: List[Callable[[], operations.GraphOfOperations]],
    budget: float,
    lm_name: str,
) -> float

Parameters

Parameter Type Description
data_ids List[int] Indices of the dataset samples to run. If empty or None, all samples are used.
methods List[Callable] List of topology builder functions. Each function returns a GraphOfOperations.
budget float Maximum LLM spending limit in dollars. Execution halts when the budget is depleted.
lm_name str Name of the language model to use (e.g., "chatgpt").

Returns

Type Description
float Total amount spent in dollars across all method/sample executions.

I/O

  • Input: data_ids (list of sample indices), methods (list of topology builder callables), budget (float, dollars), lm_name (string).
  • Output: Per-sample JSON files (one per sample per method) in a timestamped results directory, plus a config.json and a log.log. Returns total cost as a float.

Canonical Pattern

The following pseudocode captures the common structure across all three examples:

def run(data_ids, methods, budget, lm_name):
    orig_budget = budget

    # 1. Load dataset from CSV
    data = load_csv(dataset_path)
    selected_data = [data[i] for i in data_ids]

    # 2. Create timestamped results directory
    results_folder = create_results_folder(lm_name, methods)
    save_config(results_folder, selected_data, methods, lm_name, budget)
    setup_logging(results_folder)

    # 3. Create per-method subdirectories
    for method in methods:
        os.makedirs(os.path.join(results_folder, method.__name__))

    # 4. Core benchmark loop
    for data in selected_data:
        if budget <= 0.0:
            break
        for method in methods:
            if budget <= 0.0:
                break

            # Fresh LM instance per execution
            lm = language_models.ChatGPT(config_path, model_name=lm_name, cache=True)

            # Build the topology
            operations_graph = method()

            # Create controller with domain-specific prompter/parser
            executor = controller.Controller(
                lm,
                operations_graph,
                DomainPrompter(),
                DomainParser(),
                {  # initial problem state
                    "original": data[1],
                    "current": "",
                    "phase": 0,
                    "method": method.__name__,
                },
            )

            # Execute
            try:
                executor.run()
            except Exception as e:
                logging.error(f"Exception: {e}")

            # Serialize results
            path = os.path.join(results_folder, method.__name__, f"{data[0]}.json")
            executor.output_graph(path)

            # Track cost
            budget -= lm.cost

    return orig_budget - budget

Example-Specific Differences

While all three examples follow the same core pattern, there are minor domain-specific differences:

Sorting

# Initial problem state for sorting
{
    "original": data[1],   # unsorted list as string, e.g. "[3, 7, 0, 2, ...]"
    "current": "",
    "phase": 0,
    "method": method.__name__,
}

The sorting example uses SortingPrompter and SortingParser. The dataset CSV contains: [id, unsorted_list, sorted_list].

Keyword Counting

# Initial problem state for keyword counting
{
    "original": data[1],       # input text
    "ground_truth": data[2],   # correct frequency list
    "current": "",
    "phase": 0,
    "method": method.__name__,
}

The keyword counting example additionally computes all_potential_countries from the dataset and passes it to each method function as a parameter. It uses KeywordCountingPrompter and KeywordCountingParser.

Document Merging

# Initial problem state for document merging
{
    "documents": [data[2], data[3], data[4], data[5]],  # 4 NDA documents
    "parts": set(),
    "current": "",
    "method": method.__name__,
}

The document merging example passes four documents as the input. It also includes additional post-processing to convert set objects to list before JSON serialization, since Python sets are not JSON-serializable:

for operation in operations_graph.operations:
    for thought in operation.thoughts:
        thought.state["parts"] = list(thought.state["parts"])
executor.output_graph(path)

It uses DocMergePrompter and DocMergeParser.

Output Directory Structure

Each benchmark run produces the following directory structure:

results/
  {lm_name}_{method1}-{method2}-..._{timestamp}/
    config.json          # run configuration
    log.log              # debug-level execution log
    io/
      0.json             # results for sample 0
      1.json             # results for sample 1
      ...
    cot/
      0.json
      1.json
      ...
    got/
      0.json
      1.json
      ...

Usage Example

if __name__ == "__main__":
    budget = 30  # dollars
    samples = list(range(0, 100))
    approaches = [io, cot, tot, tot2, got]

    spent = run(samples, approaches, budget, "chatgpt")
    logging.info(f"Spent {spent} out of {budget} budget.")

Related

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment