Heuristic:Princeton nlp Tree of thought llm Value Caching
| Knowledge Sources | |
|---|---|
| Domains | Optimization, LLM_Reasoning |
| Last Updated | 2026-02-14 04:00 GMT |
Overview
Caching technique that stores LLM value evaluations in `task.value_cache` to avoid redundant API calls for previously evaluated prompts.
Description
When evaluating candidate thoughts in the ToT BFS loop, the same value prompt may be generated multiple times across different search steps (e.g., when the same partial solution appears in multiple branches). The value_cache dictionary, stored on each Task object, maps value prompt strings to their computed scores. Before making an LLM call, get_value() checks this cache and returns the stored result if available, completely skipping the API call.
Usage
Use this heuristic whenever running ToT BFS experiments with value-based evaluation (--method_evaluate value). The cache is enabled by default (cache_value=True) and is especially beneficial on tasks with many search steps like Game of 24 (4 steps) or Crosswords (10 steps), where the same intermediate states may recur across different candidate branches.
The Insight (Rule of Thumb)
- Action: Initialize `self.value_cache = {}` on the Task object; pass `cache_value=True` (the default) to `get_value()`.
- Value: Eliminates duplicate LLM calls entirely. Cost savings proportional to the overlap in explored states.
- Trade-off: Assumes LLM responses are deterministic for the same prompt at a given temperature. This is approximately true but not guaranteed. Memory usage grows with the number of unique prompts.
Reasoning
Each LLM API call costs money (GPT-4 at $0.03/$0.06 per 1K tokens) and adds latency. In tree search, the same partial solution can appear at the same step across different branches — especially when `n_select_sample` is large. By caching at the prompt level (the full value prompt string), the framework guarantees that identical evaluation queries are never sent twice. This is safe because value evaluation uses temperature=0.7 by default, but the aggregated value (e.g., counting "sure"/"likely"/"impossible" over n samples) is cached, not individual samples.
Code Evidence
Value cache check from `src/tot/methods/bfs.py:6-14`:
def get_value(task, x, y, n_evaluate_sample, cache_value=True):
value_prompt = task.value_prompt_wrap(x, y)
if cache_value and value_prompt in task.value_cache:
return task.value_cache[value_prompt]
value_outputs = gpt(value_prompt, n=n_evaluate_sample, stop=None)
value = task.value_outputs_unwrap(x, y, value_outputs)
if cache_value:
task.value_cache[value_prompt] = value
return value
Cache initialization from `src/tot/tasks/game24.py:34`:
self.value_cache = {}