Heuristic:Cohere ai Cohere python Tokenizer Cache With TTL
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Caching |
| Last Updated | 2026-02-15 14:00 GMT |
Overview
Tokenizer configurations are downloaded once per model and cached in-memory with a 1-hour TTL, with silent fallback to API-based tokenization on failure.
Description
The SDK provides offline tokenization by downloading HuggingFace tokenizer configs from Cohere's model API and caching them in a simple in-memory dictionary with lazy expiration. The cache uses a `(expiry_timestamp, value)` tuple pattern where expired entries are removed on next access. The default TTL is 3600 seconds (1 hour). If offline tokenization fails for any reason, the SDK silently falls back to the API-based tokenize/detokenize endpoints with a warning header (`sdk-api-warning-message: offline_tokenizer_failed`).
Usage
This heuristic is relevant when:
- Optimizing tokenization throughput (avoid repeated downloads for the same model)
- Working in offline or restricted environments (first call must download, subsequent calls use cache)
- Debugging tokenization failures (check if offline mode silently fell back to API)
- Long-running processes (cache expires after 1 hour, triggering a re-download)
The Insight (Rule of Thumb)
- Action: Use the default `offline=True` for tokenization; the SDK handles caching automatically.
- Value: `ttl = 60 * 60` (1 hour), in-memory dictionary cache, lazy expiration.
- Trade-off: In-memory cache is lost on process restart. The 1-hour TTL means tokenizer configs are re-downloaded hourly in long-running services. No size limit on cache.
- Fallback: If offline tokenization fails (network error, missing tokenizer URL), the SDK silently falls back to API calls.
Reasoning
Tokenizer configs can be several megabytes and downloading them for every tokenize call would be prohibitively slow. The 1-hour TTL balances freshness (tokenizer configs rarely change) against memory usage. The silent fallback ensures tokenization never fails completely; API calls are slower but always available. The warning header allows server-side monitoring of offline tokenizer failures without breaking the user's workflow.
Code Evidence
Cache implementation from `manually_maintained/cache.py:5-23`:
class CacheMixin:
# A simple in-memory cache with TTL (thread safe).
# This is used to cache tokenizers at the moment.
_cache: typing.Dict[str, typing.Tuple[typing.Optional[float], typing.Any]] = dict()
def _cache_get(self, key: str) -> typing.Any:
val = self._cache.get(key)
if val is None:
return None
expiry_timestamp, value = val
if expiry_timestamp is None or expiry_timestamp > time.time():
return value
del self._cache[key] # remove expired cache entry
def _cache_set(self, key: str, value: typing.Any, ttl: int = 60 * 60) -> None:
expiry_timestamp = None
if ttl is not None:
expiry_timestamp = time.time() + ttl
self._cache[key] = (expiry_timestamp, value)
Tokenizer caching from `manually_maintained/tokenizers.py:19-40`:
def get_hf_tokenizer(co: "Client", model: str) -> Tokenizer:
"""Returns a HF tokenizer from a given tokenizer config URL."""
tokenizer = co._cache_get(tokenizer_cache_key(model))
if tokenizer is not None:
return tokenizer
tokenizer_url = co.models.get(model).tokenizer_url
if not tokenizer_url:
raise ValueError(f"No tokenizer URL found for model {model}")
try:
size = _get_tokenizer_config_size(tokenizer_url)
logger.info(f"Downloading tokenizer for model {model}. Size is {size} MBs.")
except Exception as e:
logger.warn(f"Failed to get the size of the tokenizer config: {e}")
response = requests.get(tokenizer_url)
tokenizer = Tokenizer.from_str(response.text)
co._cache_set(tokenizer_cache_key(model), tokenizer)
return tokenizer
Silent fallback to API from `client.py:271-279`:
if offline:
try:
tokens = local_tokenizers.local_tokenize(self, text=text, model=model)
return TokenizeResponse(tokens=tokens, token_strings=[])
except Exception:
# Fallback to calling the API.
opts["additional_headers"] = opts.get("additional_headers", {})
opts["additional_headers"]["sdk-api-warning-message"] = "offline_tokenizer_failed"