Heuristic:BerriAI Litellm Token Counting Buffer

Knowledge Sources	BerriAI/litellm Token counting imprecision
Domains	LLM_Gateway, Optimization
Last Updated	2026-02-15 16:00 GMT

Overview

Token counting safety buffer using max(10% of input tokens, 10 tokens) to compensate for imprecise token estimation across different providers and tokenizers.

Description

When LiteLLM automatically calculates `max_tokens` for a request (to fit within a model's context window), it adds a safety buffer to the estimated input token count. This buffer compensates for the inherent imprecision of client-side token counting, which can differ from provider-side counting due to different tokenizer versions, special token handling, and multimodal content. The buffer uses a max-of-two strategy: either 10% of the input tokens or a flat 10-token minimum, whichever is larger.

Usage

This heuristic is automatically applied when litellm.trim_messages() or auto max_tokens calculation is used. Understanding it helps when debugging cases where the actual token count differs from the estimated count, or when fine-tuning the `buffer_perc` and `buffer_num` parameters.

The Insight (Rule of Thumb)

Action: When calculating available output tokens, add a buffer to the estimated input token count.
Value: `buffer = max(0.1 * input_tokens, 10)`
- For a 1000-token input: buffer = max(100, 10) = 100 tokens
- For a 50-token input: buffer = max(5, 10) = 10 tokens
Trade-off: The buffer slightly reduces the available output token budget but prevents context window overflow errors. A 10% buffer is conservative enough to avoid wasting tokens while large enough to cover tokenizer discrepancies.

Reasoning

Token counting is imprecise for several reasons:

Tokenizer mismatch: The tiktoken tokenizer used client-side may differ from the provider's tokenizer (especially for non-OpenAI models).
Special tokens: System prompts, function calling schemas, and tool definitions add tokens that are hard to count precisely.
Image tokens: Image content in multimodal messages uses a tile-based estimation that can vary from actual usage.
Provider overhead: Some providers add formatting tokens (e.g., Anthropic's `\n\nHuman:` and `\n\nAssistant:` prefixes).

The max-of-two approach ensures small messages still get at least 10 tokens of buffer (where 10% would be negligible), while large messages get a proportional buffer.

Code Evidence

Token buffer calculation from `litellm/litellm_core_utils/token_counter.py:78-87`:

# token buffer
if buffer_perc is None:
    buffer_perc = 0.1
if buffer_num is None:
    buffer_num = 10
token_buffer = max(
    buffer_perc * input_tokens, buffer_num
)  # give at least a 10 token buffer. token counting can be imprecise.

input_tokens += int(token_buffer)

Image token estimation from `litellm/constants.py:50,59-60`:

DEFAULT_IMAGE_TOKEN_COUNT = int(os.getenv("DEFAULT_IMAGE_TOKEN_COUNT", 250))
DEFAULT_IMAGE_WIDTH = int(os.getenv("DEFAULT_IMAGE_WIDTH", 300))
DEFAULT_IMAGE_HEIGHT = int(os.getenv("DEFAULT_IMAGE_HEIGHT", 300))

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment