Heuristic:Ucbepic Docetl Token Counting And Truncation
| Knowledge Sources | |
|---|---|
| Domains | LLM_Pipelines, Optimization |
| Last Updated | 2026-02-08 01:00 GMT |
Overview
Token management strategy using 4-chars-per-token approximation, 100-token safety margin, and middle-truncation to handle context window limits.
Description
DocETL manages LLM context windows through a multi-layered token management strategy:
- Fast approximation: 4 characters per token for quick estimates without calling a tokenizer
- Precise counting: tiktoken-based counting for actual truncation decisions
- Safety margins: 100-token buffer before context limit, 200-token excess buffer when truncating
- Middle truncation: When content exceeds context, the longest message is truncated from the middle, preserving both the beginning (instructions, context) and end (recent content, questions) of the text
This is critical because exceeding an LLM's context window causes API errors, and naive truncation (from the end) loses important content.
Usage
Use this heuristic when designing prompts for long documents or debugging truncation-related quality issues. If your pipeline processes long documents and output quality is poor, the content may be getting middle-truncated. Consider using the Split-Gather workflow instead.
The Insight (Rule of Thumb)
- Action 1: For quick estimates, use 4 characters = 1 token approximation.
- Action 2: Reserve 100 tokens below model context limit as safety margin.
- Action 3: When truncating, add 200 extra tokens of buffer to account for token counting inaccuracies.
- Action 4: Truncate from the middle of the longest message, not the end.
- Value: Default context fallback for unknown models: 32,768 tokens. For embedding/blocking operations: 8,192 tokens.
- Trade-off: Middle truncation preserves both document start and end, but may cut important content in the center of long documents.
Reasoning
- 4 chars/token: This is a well-known approximation for English text with GPT-family tokenizers. It is inaccurate for non-English text or code, but sufficient for quick pre-checks.
- 100-token margin: Token counting is not perfectly accurate across different tokenizer implementations. The margin prevents edge cases where a message is just barely over the limit.
- 200-token truncation buffer: When we know truncation is needed, an extra buffer ensures the truncated message is safely within bounds.
- Middle truncation: Document beginnings typically contain headers, metadata, and context. Document endings contain conclusions and recent information. The middle is statistically the least important for understanding the overall structure.
- 32K fallback: A safe default that works with most modern models (GPT-4, Claude, etc.).
Code Evidence
Approximate token counting from `docetl/operations/utils/llm.py:70-72`:
def approx_count_tokens(messages: list[dict[str, str]]) -> int:
"""Approximately 4 characters per token."""
return int(sum(len(msg["content"]) for msg in messages) / 4)
Safety margin and truncation from `docetl/operations/utils/llm.py:94-104`:
model_input_context_length = model_cost_info.get("max_input_tokens", 32768)
total_tokens = sum(count_tokens(json.dumps(msg), model) for msg in messages)
if total_tokens <= model_input_context_length - 100:
return messages
truncated_messages = messages.copy()
longest_message = max(truncated_messages, key=lambda x: len(x["content"]))
content = longest_message["content"]
excess_tokens = total_tokens - model_input_context_length + 200
Middle truncation from `docetl/operations/utils/llm.py:111-114`:
tokens_to_remove = min(len(encoded_content), excess_tokens)
mid_point = len(encoded_content) // 2
truncated_encoded = (
encoded_content[: mid_point - tokens_to_remove // 2]
+ encoded_content[mid_point + tokens_to_remove // 2 :]
)