Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Ucbepic Docetl Count Tokens

From Leeroopedia
Revision as of 17:00, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Ucbepic_Docetl_Count_Tokens.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains NLP, Token_Management
Last Updated 2026-02-08 01:40 GMT

Overview

Concrete utility for estimating token counts in text content provided by DocETL's search utilities.

Description

The count_tokens() function estimates token counts using a simple character-based heuristic (total characters / 4). The companion ConfigGenerator._generate_chunk_sizes() in the map optimizer generates a range of recommended chunk sizes based on document length distribution and model context limits.

Usage

Use count_tokens() for quick token estimates during optimization. For precise chunk size planning, use the ConfigGenerator which accounts for the full document length distribution.

Code Reference

Source Location

  • Repository: docetl
  • File: docetl/moar/search_utils.py (L35-41), docetl/optimizers/map_optimizer/config_generators.py (L412-455)

Signature

def count_tokens(messages) -> int:
    """Count estimated tokens in messages list.
    Uses characters / 4 heuristic."""

class ConfigGenerator:
    def _generate_chunk_sizes(
        self,
        split_key: str,
        input_data_sample: list[dict[str, Any]],
        token_limit: int,
        num_chunks: int = 8,
    ) -> list[int]:
        """Generate recommended chunk sizes based on document length distribution."""

Import

from docetl.moar.search_utils import count_tokens

I/O Contract

Inputs

Name Type Required Description
messages list[dict] Yes Messages with "content" keys to count tokens for
split_key str Yes Document field name to measure
token_limit int Yes Model context window size

Outputs

Name Type Description
count_tokens returns int Estimated token count
_generate_chunk_sizes returns list[int] Recommended chunk sizes in tokens

Usage Examples

from docetl.moar.search_utils import count_tokens

messages = [{"content": "This is a sample document with some text content."}]
tokens = count_tokens(messages)
print(f"Estimated tokens: {tokens}")

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment