Heuristic:BerriAI Litellm Token Counting Buffer
| Knowledge Sources | |
|---|---|
| Domains | LLM_Gateway, Optimization |
| Last Updated | 2026-02-15 16:00 GMT |
Overview
Token counting safety buffer using max(10% of input tokens, 10 tokens) to compensate for imprecise token estimation across different providers and tokenizers.
Description
When LiteLLM automatically calculates `max_tokens` for a request (to fit within a model's context window), it adds a safety buffer to the estimated input token count. This buffer compensates for the inherent imprecision of client-side token counting, which can differ from provider-side counting due to different tokenizer versions, special token handling, and multimodal content. The buffer uses a max-of-two strategy: either 10% of the input tokens or a flat 10-token minimum, whichever is larger.
Usage
This heuristic is automatically applied when litellm.trim_messages() or auto max_tokens calculation is used. Understanding it helps when debugging cases where the actual token count differs from the estimated count, or when fine-tuning the `buffer_perc` and `buffer_num` parameters.
The Insight (Rule of Thumb)
- Action: When calculating available output tokens, add a buffer to the estimated input token count.
- Value: `buffer = max(0.1 * input_tokens, 10)`
- For a 1000-token input: buffer = max(100, 10) = 100 tokens
- For a 50-token input: buffer = max(5, 10) = 10 tokens
- Trade-off: The buffer slightly reduces the available output token budget but prevents context window overflow errors. A 10% buffer is conservative enough to avoid wasting tokens while large enough to cover tokenizer discrepancies.
Reasoning
Token counting is imprecise for several reasons:
- Tokenizer mismatch: The tiktoken tokenizer used client-side may differ from the provider's tokenizer (especially for non-OpenAI models).
- Special tokens: System prompts, function calling schemas, and tool definitions add tokens that are hard to count precisely.
- Image tokens: Image content in multimodal messages uses a tile-based estimation that can vary from actual usage.
- Provider overhead: Some providers add formatting tokens (e.g., Anthropic's `\n\nHuman:` and `\n\nAssistant:` prefixes).
The max-of-two approach ensures small messages still get at least 10 tokens of buffer (where 10% would be negligible), while large messages get a proportional buffer.
Code Evidence
Token buffer calculation from `litellm/litellm_core_utils/token_counter.py:78-87`:
# token buffer
if buffer_perc is None:
buffer_perc = 0.1
if buffer_num is None:
buffer_num = 10
token_buffer = max(
buffer_perc * input_tokens, buffer_num
) # give at least a 10 token buffer. token counting can be imprecise.
input_tokens += int(token_buffer)
Image token estimation from `litellm/constants.py:50,59-60`:
DEFAULT_IMAGE_TOKEN_COUNT = int(os.getenv("DEFAULT_IMAGE_TOKEN_COUNT", 250))
DEFAULT_IMAGE_WIDTH = int(os.getenv("DEFAULT_IMAGE_WIDTH", 300))
DEFAULT_IMAGE_HEIGHT = int(os.getenv("DEFAULT_IMAGE_HEIGHT", 300))