Heuristic:Spcl Graph of thoughts Budget Gated Benchmark Execution
| Knowledge Sources | |
|---|---|
| Domains | LLM_Reasoning, Optimization |
| Last Updated | 2026-02-14 03:30 GMT |
Overview
Cost control pattern that checks remaining API budget before each sample and method execution, stopping early when the budget is depleted.
Description
The benchmark execution pattern (`run` function) implements a budget gate: before processing each data sample and each method within that sample, it checks whether the remaining dollar budget is positive. If the budget is depleted, execution stops with an error log, preventing runaway API costs. The budget is tracked by accumulating the `lm.cost` property from each ChatGPT instance, which computes cost from prompt and completion token counts multiplied by configured per-thousand-token prices.
Usage
Use this heuristic when running multi-sample benchmark experiments with paid LLM APIs. It is essential for:
- Running 100+ samples across multiple methods (IO, CoT, ToT, GoT)
- Experiments where total cost is uncertain ahead of time
- Preventing accidental overspending during development and testing
The Insight (Rule of Thumb)
- Action: Set a dollar `budget` limit and check it before each execution unit. Deduct actual cost after each run.
- Value: Default budget is $30 for the sorting benchmark (100 samples x 5 methods).
- Trade-off: Some samples/methods may not be executed if the budget runs out. Results will be incomplete but costs are controlled.
- Pattern: Instantiate a fresh LM per method-sample pair to get accurate per-run cost tracking.
Reasoning
LLM API costs scale linearly with the number of tokens processed. GoT methods use significantly more tokens than IO/CoT approaches (multiple generations, scores, aggregations per sample). Without budget gating:
- A GoT benchmark on 100 samples could cost $50-100+
- A bug in prompt design could cause a tight loop generating unlimited tokens
- Debugging runs during development would accumulate unexpected costs
The budget gate ensures experiments are self-limiting and costs are predictable.
Code Evidence
Budget check before each sample from `examples/sorting/sorting_032.py:667-670`:
if budget <= 0.0:
logging.error(
f"Budget has been depleted, stopping. Data {data[0]} has not been run."
)
break
Budget check before each method from `examples/sorting/sorting_032.py:675-679`:
if budget <= 0.0:
logging.error(
f"Budget has been depleted, stopping. Method {method.__name__} has not been run."
)
break
Cost deduction after each run from `examples/sorting/sorting_032.py:711`:
budget -= lm.cost
Cost tracking in ChatGPT from `graph_of_thoughts/language_models/chatgpt.py:126-133`:
self.prompt_tokens += response.usage.prompt_tokens
self.completion_tokens += response.usage.completion_tokens
prompt_tokens_k = float(self.prompt_tokens) / 1000.0
completion_tokens_k = float(self.completion_tokens) / 1000.0
self.cost = (
self.prompt_token_cost * prompt_tokens_k
+ self.response_token_cost * completion_tokens_k
)
Default budget from `examples/sorting/sorting_032.py:726`:
budget = 30