Heuristic:EvolvingLMMs Lab Lmms eval Request Caching Strategy

Knowledge Sources	lmms-eval Evaluation pipeline optimization
Domains	Caching, Optimization
Last Updated	2026-02-14 00:00 GMT

Overview

Evaluation request caching with highly specific cache keys and full-document processing to avoid stale or incomplete cached data.

Description

The lmms-eval framework supports caching built evaluation requests (the preprocessed instances sent to models) to avoid re-processing datasets on subsequent runs. Cache keys are constructed from multiple parameters to prevent cache collisions: task name, num_fewshot, rank, world_size, chat template usage, multiturn format, system prompt hash, and tokenizer name. A critical implementation detail is that when building a cache, the --limit flag is temporarily ignored so the cache contains the complete dataset. The limit is applied only after cache loading.

Additionally, doc_to_visual (callable functions) cannot be serialized and must be restored manually after loading from cache.

Usage

This heuristic applies when using --cache_requests true or --cache_requests refresh. It is particularly valuable when:

Re-running the same task with different models
Developing and debugging task configurations
Running distributed evaluations where request building is expensive

The Insight (Rule of Thumb)

Action: Enable request caching with --cache_requests true for repeated evaluations on the same tasks.
Value: Cache keys include 8+ parameters to prevent stale data; limit is ignored during cache building.
Trade-off: Disk space for cache files vs. significant time savings on dataset preprocessing. Cache invalidation requires explicit --cache_requests refresh.
Caveat: doc_to_visual callables cannot be pickled and are set to None in cache, requiring manual restoration.

Reasoning

Request building involves loading datasets, applying templates, constructing fewshot examples, and processing media — all computationally expensive operations. For tasks with large datasets (10,000+ instances), this can take several minutes even before any model inference begins. Caching these prepared requests eliminates this overhead on subsequent runs.

The cache key specificity prevents subtle bugs: a request cached with one chat template would be invalid for a different template. Similarly, distributed caches are rank-specific because each rank processes a different data shard.

The limit-override behavior ensures that a cache built with --limit 8 would not be incomplete for a subsequent full evaluation run. Instead, the full dataset is always cached and the limit is applied after loading.

Cache key construction from lmms_eval/api/task.py:410-434:

cache_key = f"requests-{self._config.task}-{self.config.num_fewshot}shot-rank{rank}-world_size{world_size}"
if offset:
    cache_key += f"-offset{offset}"
cache_key += "-chat_template" if apply_chat_template else ""
cache_key += "-fewshot_as_multiturn" if fewshot_as_multiturn else ""
cache_key += f"-system_prompt_hash{utils.hash_string(system_instruction)}" if system_instruction is not None else ""
cache_key += f"-tokenizer{tokenizer_name}"

Limit override during cache building from lmms_eval/api/task.py:432-434:

# process all documents when caching is specified for simplicity
if cache_requests and (not cached_instances or rewrite_requests_cache) and limit is not None:
    limit = None

doc_to_visual restoration from lmms_eval/api/task.py:503-510:

# FIXME: Bo - We need to check if the doc_to_visual if it's exists and restore it.
# If we use cache, the doc_to_visual will be None since it's not serializable
for instance in self._instances:
    if instance.arguments[2] is None:
        arguments = (instance.arguments[0], instance.arguments[1], self.doc_to_visual, *instance.arguments[3:])
    else:
        arguments = instance.arguments
    instance.arguments = arguments

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment