Heuristic:Princeton nlp Tree of thought llm API Request Batching
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, Optimization |
| Last Updated | 2026-02-14 04:00 GMT |
Overview
Splits large OpenAI API requests into batches of 20 completions to stay within per-request limits while fulfilling arbitrarily large sample counts.
Description
The OpenAI ChatCompletion API has a practical limit on the n parameter (number of completions per request). The chatgpt() function in the framework handles this transparently by splitting any request with n > 20 into sequential batches of at most 20. The results are accumulated and returned as a single flat list, making the batching invisible to callers.
Usage
This heuristic is automatically applied whenever gpt() or chatgpt() is called with n > 20. This occurs in experiments that require many samples per step, such as baseline IO/CoT sampling with --n_generate_sample 100 or evaluation with high --n_evaluate_sample.
The Insight (Rule of Thumb)
- Action: In the LLM call loop, use `cnt = min(n, 20)` to cap each individual API request at 20 completions.
- Value: 20 is a safe batch size that avoids API errors and timeouts for most OpenAI models.
- Trade-off: Sequential batching increases total wall-clock time compared to a single large request (if it were allowed). For n=100, this means 5 sequential API calls instead of 1.
Reasoning
The OpenAI API can return errors or timeouts for very large n values in a single request. The batch size of 20 was chosen as a practical ceiling that balances throughput against reliability. Combined with the backoff retry decorator, this ensures that even large sampling experiments complete without manual intervention. The batching is transparent — callers simply pass the desired n and receive a flat list.
Code Evidence
Batching logic from `src/tot/models.py:26-37`:
def chatgpt(messages, model="gpt-4", temperature=0.7, max_tokens=1000, n=1, stop=None) -> list:
global completion_tokens, prompt_tokens
outputs = []
while n > 0:
cnt = min(n, 20)
n -= cnt
res = completions_with_backoff(model=model, messages=messages, temperature=temperature, max_tokens=max_tokens, n=cnt, stop=stop)
outputs.extend([choice.message.content for choice in res.choices])
# log completion tokens
completion_tokens += res.usage.completion_tokens
prompt_tokens += res.usage.prompt_tokens
return outputs