Heuristic:Cohere ai Cohere python Tokenizer Cache With TTL

Knowledge Sources	Cohere Python SDK SDK caching design
Domains	Optimization, Caching
Last Updated	2026-02-15 14:00 GMT

Overview

Tokenizer configurations are downloaded once per model and cached in-memory with a 1-hour TTL, with silent fallback to API-based tokenization on failure.

Description

The SDK provides offline tokenization by downloading HuggingFace tokenizer configs from Cohere's model API and caching them in a simple in-memory dictionary with lazy expiration. The cache uses a `(expiry_timestamp, value)` tuple pattern where expired entries are removed on next access. The default TTL is 3600 seconds (1 hour). If offline tokenization fails for any reason, the SDK silently falls back to the API-based tokenize/detokenize endpoints with a warning header (`sdk-api-warning-message: offline_tokenizer_failed`).

Usage

This heuristic is relevant when:

Optimizing tokenization throughput (avoid repeated downloads for the same model)
Working in offline or restricted environments (first call must download, subsequent calls use cache)
Debugging tokenization failures (check if offline mode silently fell back to API)
Long-running processes (cache expires after 1 hour, triggering a re-download)

The Insight (Rule of Thumb)

Action: Use the default `offline=True` for tokenization; the SDK handles caching automatically.
Value: `ttl = 60 * 60` (1 hour), in-memory dictionary cache, lazy expiration.
Trade-off: In-memory cache is lost on process restart. The 1-hour TTL means tokenizer configs are re-downloaded hourly in long-running services. No size limit on cache.
Fallback: If offline tokenization fails (network error, missing tokenizer URL), the SDK silently falls back to API calls.

Reasoning

Tokenizer configs can be several megabytes and downloading them for every tokenize call would be prohibitively slow. The 1-hour TTL balances freshness (tokenizer configs rarely change) against memory usage. The silent fallback ensures tokenization never fails completely; API calls are slower but always available. The warning header allows server-side monitoring of offline tokenizer failures without breaking the user's workflow.

Code Evidence

Cache implementation from `manually_maintained/cache.py:5-23`:

class CacheMixin:
    # A simple in-memory cache with TTL (thread safe).
    # This is used to cache tokenizers at the moment.
    _cache: typing.Dict[str, typing.Tuple[typing.Optional[float], typing.Any]] = dict()

    def _cache_get(self, key: str) -> typing.Any:
        val = self._cache.get(key)
        if val is None:
            return None
        expiry_timestamp, value = val
        if expiry_timestamp is None or expiry_timestamp > time.time():
            return value
        del self._cache[key]  # remove expired cache entry

    def _cache_set(self, key: str, value: typing.Any, ttl: int = 60 * 60) -> None:
        expiry_timestamp = None
        if ttl is not None:
            expiry_timestamp = time.time() + ttl
        self._cache[key] = (expiry_timestamp, value)

Tokenizer caching from `manually_maintained/tokenizers.py:19-40`:

def get_hf_tokenizer(co: "Client", model: str) -> Tokenizer:
    """Returns a HF tokenizer from a given tokenizer config URL."""
    tokenizer = co._cache_get(tokenizer_cache_key(model))
    if tokenizer is not None:
        return tokenizer
    tokenizer_url = co.models.get(model).tokenizer_url
    if not tokenizer_url:
        raise ValueError(f"No tokenizer URL found for model {model}")

    try:
        size = _get_tokenizer_config_size(tokenizer_url)
        logger.info(f"Downloading tokenizer for model {model}. Size is {size} MBs.")
    except Exception as e:
        logger.warn(f"Failed to get the size of the tokenizer config: {e}")

    response = requests.get(tokenizer_url)
    tokenizer = Tokenizer.from_str(response.text)
    co._cache_set(tokenizer_cache_key(model), tokenizer)
    return tokenizer

Silent fallback to API from `client.py:271-279`:

if offline:
    try:
        tokens = local_tokenizers.local_tokenize(self, text=text, model=model)
        return TokenizeResponse(tokens=tokens, token_strings=[])
    except Exception:
        # Fallback to calling the API.
        opts["additional_headers"] = opts.get("additional_headers", {})
        opts["additional_headers"]["sdk-api-warning-message"] = "offline_tokenizer_failed"

Related Pages

Implementation:Cohere_ai_Cohere_python_ClientV2_Init

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment