Heuristic:Princeton nlp Tree of thought llm Value Caching

Knowledge Sources	tree-of-thought-llm Tree of Thoughts
Domains	Optimization, LLM_Reasoning
Last Updated	2026-02-14 04:00 GMT

Overview

Caching technique that stores LLM value evaluations in `task.value_cache` to avoid redundant API calls for previously evaluated prompts.

Description

When evaluating candidate thoughts in the ToT BFS loop, the same value prompt may be generated multiple times across different search steps (e.g., when the same partial solution appears in multiple branches). The value_cache dictionary, stored on each Task object, maps value prompt strings to their computed scores. Before making an LLM call, get_value() checks this cache and returns the stored result if available, completely skipping the API call.

Usage

Use this heuristic whenever running ToT BFS experiments with value-based evaluation (--method_evaluate value). The cache is enabled by default (cache_value=True) and is especially beneficial on tasks with many search steps like Game of 24 (4 steps) or Crosswords (10 steps), where the same intermediate states may recur across different candidate branches.

The Insight (Rule of Thumb)

Action: Initialize `self.value_cache = {}` on the Task object; pass `cache_value=True` (the default) to `get_value()`.
Value: Eliminates duplicate LLM calls entirely. Cost savings proportional to the overlap in explored states.
Trade-off: Assumes LLM responses are deterministic for the same prompt at a given temperature. This is approximately true but not guaranteed. Memory usage grows with the number of unique prompts.

Reasoning

Each LLM API call costs money (GPT-4 at $0.03/$0.06 per 1K tokens) and adds latency. In tree search, the same partial solution can appear at the same step across different branches — especially when `n_select_sample` is large. By caching at the prompt level (the full value prompt string), the framework guarantees that identical evaluation queries are never sent twice. This is safe because value evaluation uses temperature=0.7 by default, but the aggregated value (e.g., counting "sure"/"likely"/"impossible" over n samples) is cached, not individual samples.

Code Evidence

Value cache check from `src/tot/methods/bfs.py:6-14`:

def get_value(task, x, y, n_evaluate_sample, cache_value=True):
    value_prompt = task.value_prompt_wrap(x, y)
    if cache_value and value_prompt in task.value_cache:
        return task.value_cache[value_prompt]
    value_outputs = gpt(value_prompt, n=n_evaluate_sample, stop=None)
    value = task.value_outputs_unwrap(x, y, value_outputs)
    if cache_value:
        task.value_cache[value_prompt] = value
    return value

Cache initialization from `src/tot/tasks/game24.py:34`:

self.value_cache = {}

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment