Heuristic:EvolvingLMMs Lab Lmms eval Request Caching Strategy
| Knowledge Sources | |
|---|---|
| Domains | Caching, Optimization |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
Evaluation request caching with highly specific cache keys and full-document processing to avoid stale or incomplete cached data.
Description
The lmms-eval framework supports caching built evaluation requests (the preprocessed instances sent to models) to avoid re-processing datasets on subsequent runs. Cache keys are constructed from multiple parameters to prevent cache collisions: task name, num_fewshot, rank, world_size, chat template usage, multiturn format, system prompt hash, and tokenizer name. A critical implementation detail is that when building a cache, the --limit flag is temporarily ignored so the cache contains the complete dataset. The limit is applied only after cache loading.
Additionally, doc_to_visual (callable functions) cannot be serialized and must be restored manually after loading from cache.
Usage
This heuristic applies when using --cache_requests true or --cache_requests refresh. It is particularly valuable when:
- Re-running the same task with different models
- Developing and debugging task configurations
- Running distributed evaluations where request building is expensive
The Insight (Rule of Thumb)
- Action: Enable request caching with
--cache_requests truefor repeated evaluations on the same tasks. - Value: Cache keys include 8+ parameters to prevent stale data; limit is ignored during cache building.
- Trade-off: Disk space for cache files vs. significant time savings on dataset preprocessing. Cache invalidation requires explicit
--cache_requests refresh. - Caveat:
doc_to_visualcallables cannot be pickled and are set toNonein cache, requiring manual restoration.
Reasoning
Request building involves loading datasets, applying templates, constructing fewshot examples, and processing media — all computationally expensive operations. For tasks with large datasets (10,000+ instances), this can take several minutes even before any model inference begins. Caching these prepared requests eliminates this overhead on subsequent runs.
The cache key specificity prevents subtle bugs: a request cached with one chat template would be invalid for a different template. Similarly, distributed caches are rank-specific because each rank processes a different data shard.
The limit-override behavior ensures that a cache built with --limit 8 would not be incomplete for a subsequent full evaluation run. Instead, the full dataset is always cached and the limit is applied after loading.
Cache key construction from lmms_eval/api/task.py:410-434:
cache_key = f"requests-{self._config.task}-{self.config.num_fewshot}shot-rank{rank}-world_size{world_size}"
if offset:
cache_key += f"-offset{offset}"
cache_key += "-chat_template" if apply_chat_template else ""
cache_key += "-fewshot_as_multiturn" if fewshot_as_multiturn else ""
cache_key += f"-system_prompt_hash{utils.hash_string(system_instruction)}" if system_instruction is not None else ""
cache_key += f"-tokenizer{tokenizer_name}"
Limit override during cache building from lmms_eval/api/task.py:432-434:
# process all documents when caching is specified for simplicity
if cache_requests and (not cached_instances or rewrite_requests_cache) and limit is not None:
limit = None
doc_to_visual restoration from lmms_eval/api/task.py:503-510:
# FIXME: Bo - We need to check if the doc_to_visual if it's exists and restore it.
# If we use cache, the doc_to_visual will be None since it's not serializable
for instance in self._instances:
if instance.arguments[2] is None:
arguments = (instance.arguments[0], instance.arguments[1], self.doc_to_visual, *instance.arguments[3:])
else:
arguments = instance.arguments
instance.arguments = arguments