Principle:EvolvingLMMs Lab Lmms eval Response Caching

Knowledge Sources	EvolvingLMMs_Lab_Lmms_eval
Domains	Caching, Performance Optimization
Last Updated	2026-02-14 00:00 GMT

Overview

Response Caching provides disk-based caching of evaluation requests and responses to avoid redundant computation when re-running evaluations. This principle establishes how the framework serializes, stores, and retrieves cached evaluation data to improve efficiency and enable iterative development.

Theoretical Basis

Cache Purpose

Caching serves several key purposes:

Avoid Redundant Inference: Skip expensive model inference for previously seen inputs
Iterative Development: Test metric changes without re-running inference
Cost Reduction: Minimize API calls for expensive models
Reproducibility: Store exact model responses for analysis
Debugging: Inspect cached responses without re-evaluation

Cache Scope

What gets cached:

Request Context: Input text, images, videos, and other media
Request Arguments: Model generation parameters
Model Responses: Raw model outputs before filtering
Task Context: Enough information to reconstruct evaluation

What does not get cached:

Metric Computations: Computed fresh each time
Aggregations: Recalculated from cached responses
Filter Applications: Applied to cached responses

Cache Organization

Cache structure:

Cache Directory: Configurable location (default: module/.cache/)
File Names: Based on task/model identifiers
File Suffix: Hash-based suffix for uniqueness
Format: Pickled Python objects using dill

Serialization Strategy

Handling non-serializable objects:

Callable Detection: Identify functions and methods
Argument Sanitization: Replace callables with None in arguments
Fallback Handling: Convert to serializable alternatives
Logging: Debug info for cache operations

Design Patterns

Cache Storage

Directory Management: Create cache directory if needed
File Naming: Consistent naming with hash suffix
Pickle Protocol: Use dill for enhanced serialization
Error Handling: Graceful fallback for serialization failures

Cache Retrieval

File Existence Check: Quick check before loading
Deserialization: Load pickled objects with dill
Validation: Ensure cached data is compatible
Miss Handling: Return None on cache miss

Cache Management

Selective Deletion: Remove specific cached tasks
Pattern Matching: Delete by key prefix
Cache Invalidation: Clear cache when needed
Environment Override: Custom cache path via env var

Cache Invalidation

When to invalidate cache:

Model version changes
Task definition changes
Request construction changes
Different random seeds (if applicable)

Storage Efficiency

Use binary pickle format for space efficiency
Consider compression for large caches
Monitor cache directory size
Implement cleanup strategies

Serialization Robustness

Handle diverse object types (tensors, images, etc.)
Gracefully handle non-serializable items
Preserve enough context for reconstruction
Log serialization issues for debugging

Cache Location

Default to module directory for isolation
Support environment variable override
Consider user permissions
Document cache location clearly

Usage Examples

Basic Cache Usage

from lmms_eval.caching.cache import load_from_cache, save_to_cache

# Try loading from cache
cache_key = f"{model_name}_{task_name}"
cached_results = load_from_cache(cache_key)

if cached_results is not None:
    # Use cached results
    requests = cached_results
else:
    # Run evaluation
    requests = run_evaluation(model, task)

    # Save to cache
    save_to_cache(cache_key, requests)

Environment Variables

# Override cache location
export LM_HARNESS_CACHE_PATH=/custom/cache/path

# Run evaluation (uses custom cache)
python -m lmms_eval --model qwen25vl --tasks videomme

Cache Management

from lmms_eval.caching.cache import delete_cache

# Delete all cache files
delete_cache()

# Delete task-specific cache
delete_cache(key="videomme")

# Delete model-specific cache
delete_cache(key="qwen25vl")

Development Workflow

# First run: populate cache
python -m lmms_eval --model qwen25vl --tasks videomme

# Modify metrics/filters
# edit task YAML

# Second run: uses cache, recomputes metrics only
python -m lmms_eval --model qwen25vl --tasks videomme

A/B Testing Metrics

# Run once to cache responses
evaluate(model, tasks)

# Test metric variant A
metric_results_a = compute_metrics(cached_responses, metric_a)

# Test metric variant B (no re-inference)
metric_results_b = compute_metrics(cached_responses, metric_b)

Debugging

# Load cached requests for inspection
cached = load_from_cache("model_task")

for request_group in cached:
    for request in request_group:
        print(f"Input: {request.arguments}")
        print(f"Response: {request.response}")

Cache File Format

File Structure

.cache/
├── model_task.{hash}.pickle
├── another_task.{hash}.pickle
└── ...

Cached Object Structure

[
    [  # Request group 1
        Request(arguments=(...), response="...", ...),
        Request(arguments=(...), response="...", ...),
    ],
    [  # Request group 2
        Request(arguments=(...), response="...", ...),
    ],
]

Performance Considerations

Cache Hit Performance

Deserialization typically faster than inference
Pickle loading is I/O bound
Consider SSD for cache storage
Monitor cache file sizes

Cache Miss Performance

No overhead when cache does not exist
Quick file existence check
Minimal impact on evaluation speed

Cache Write Performance

Serialization after batch completion
Asynchronous writing possible
Monitor for serialization bottlenecks

Best Practices

Use descriptive cache keys (model_task format)
Document what triggers cache invalidation
Provide cache clearing utilities
Log cache hits/misses for monitoring
Handle serialization failures gracefully
Consider cache size limits
Clear cache when debugging metric changes
Include cache strategy in documentation

Limitations

Current Limitations

No automatic cache invalidation
No cache size limits
No cache compression
Manual cleanup required
Callable arguments replaced with None

Future Improvements

Automatic invalidation on task changes
LRU cache eviction
Compression for large caches
Better callable serialization
Cache statistics and monitoring

Integration Points

Command-Line Interface

# Clear cache before run
python -m lmms_eval --model qwen25vl --tasks videomme --clear-cache

# Run without caching
python -m lmms_eval --model qwen25vl --tasks videomme --no-cache

Programmatic Usage

from lmms_eval.caching.cache import load_from_cache, save_to_cache, delete_cache

# In evaluation loop
if use_cache:
    cached = load_from_cache(cache_key)
    if cached:
        return cached

results = evaluate(...)

if use_cache:
    save_to_cache(cache_key, results)

Related Pages

Implementations

EvolvingLMMs_Lab_Lmms_eval_Cache_Utils — core caching utilities
Implementation:EvolvingLMMs_Lab_Lmms_eval_Cache_Utils

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment

Overview

Theoretical Basis

Cache Purpose

Cache Scope

Cache Organization

Serialization Strategy

Design Patterns

Cache Storage

Cache Retrieval

Cache Management

Cache Invalidation

Storage Efficiency

Serialization Robustness

Cache Location

Usage Examples

Basic Cache Usage

Environment Variables

Cache Management

Development Workflow

A/B Testing Metrics

Debugging

Cache File Format

File Structure

Cached Object Structure

Performance Considerations

Cache Hit Performance

Cache Miss Performance

Cache Write Performance

Best Practices

Limitations

Current Limitations

Future Improvements

Integration Points

Command-Line Interface

Programmatic Usage

Related Pages

Implementations

See Also

Page Connections