Implementation:Ucbepic Docetl Count Tokens
| Knowledge Sources | |
|---|---|
| Domains | NLP, Token_Management |
| Last Updated | 2026-02-08 01:40 GMT |
Overview
Concrete utility for estimating token counts in text content provided by DocETL's search utilities.
Description
The count_tokens() function estimates token counts using a simple character-based heuristic (total characters / 4). The companion ConfigGenerator._generate_chunk_sizes() in the map optimizer generates a range of recommended chunk sizes based on document length distribution and model context limits.
Usage
Use count_tokens() for quick token estimates during optimization. For precise chunk size planning, use the ConfigGenerator which accounts for the full document length distribution.
Code Reference
Source Location
- Repository: docetl
- File: docetl/moar/search_utils.py (L35-41), docetl/optimizers/map_optimizer/config_generators.py (L412-455)
Signature
def count_tokens(messages) -> int:
"""Count estimated tokens in messages list.
Uses characters / 4 heuristic."""
class ConfigGenerator:
def _generate_chunk_sizes(
self,
split_key: str,
input_data_sample: list[dict[str, Any]],
token_limit: int,
num_chunks: int = 8,
) -> list[int]:
"""Generate recommended chunk sizes based on document length distribution."""
Import
from docetl.moar.search_utils import count_tokens
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| messages | list[dict] | Yes | Messages with "content" keys to count tokens for |
| split_key | str | Yes | Document field name to measure |
| token_limit | int | Yes | Model context window size |
Outputs
| Name | Type | Description |
|---|---|---|
| count_tokens returns | int | Estimated token count |
| _generate_chunk_sizes returns | list[int] | Recommended chunk sizes in tokens |
Usage Examples
from docetl.moar.search_utils import count_tokens
messages = [{"content": "This is a sample document with some text content."}]
tokens = count_tokens(messages)
print(f"Estimated tokens: {tokens}")