Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:EvolvingLMMs Lab Lmms eval Response Caching

From Leeroopedia
Knowledge Sources
Domains Caching, Performance Optimization
Last Updated 2026-02-14 00:00 GMT

Overview

Response Caching provides disk-based caching of evaluation requests and responses to avoid redundant computation when re-running evaluations. This principle establishes how the framework serializes, stores, and retrieves cached evaluation data to improve efficiency and enable iterative development.

Theoretical Basis

Cache Purpose

Caching serves several key purposes:

  • Avoid Redundant Inference: Skip expensive model inference for previously seen inputs
  • Iterative Development: Test metric changes without re-running inference
  • Cost Reduction: Minimize API calls for expensive models
  • Reproducibility: Store exact model responses for analysis
  • Debugging: Inspect cached responses without re-evaluation

Cache Scope

What gets cached:

  • Request Context: Input text, images, videos, and other media
  • Request Arguments: Model generation parameters
  • Model Responses: Raw model outputs before filtering
  • Task Context: Enough information to reconstruct evaluation

What does not get cached:

  • Metric Computations: Computed fresh each time
  • Aggregations: Recalculated from cached responses
  • Filter Applications: Applied to cached responses

Cache Organization

Cache structure:

  • Cache Directory: Configurable location (default: module/.cache/)
  • File Names: Based on task/model identifiers
  • File Suffix: Hash-based suffix for uniqueness
  • Format: Pickled Python objects using dill

Serialization Strategy

Handling non-serializable objects:

  • Callable Detection: Identify functions and methods
  • Argument Sanitization: Replace callables with None in arguments
  • Fallback Handling: Convert to serializable alternatives
  • Logging: Debug info for cache operations

Design Patterns

Cache Storage

  • Directory Management: Create cache directory if needed
  • File Naming: Consistent naming with hash suffix
  • Pickle Protocol: Use dill for enhanced serialization
  • Error Handling: Graceful fallback for serialization failures

Cache Retrieval

  • File Existence Check: Quick check before loading
  • Deserialization: Load pickled objects with dill
  • Validation: Ensure cached data is compatible
  • Miss Handling: Return None on cache miss

Cache Management

  • Selective Deletion: Remove specific cached tasks
  • Pattern Matching: Delete by key prefix
  • Cache Invalidation: Clear cache when needed
  • Environment Override: Custom cache path via env var

Cache Invalidation

When to invalidate cache:

  • Model version changes
  • Task definition changes
  • Request construction changes
  • Different random seeds (if applicable)

Storage Efficiency

  • Use binary pickle format for space efficiency
  • Consider compression for large caches
  • Monitor cache directory size
  • Implement cleanup strategies

Serialization Robustness

  • Handle diverse object types (tensors, images, etc.)
  • Gracefully handle non-serializable items
  • Preserve enough context for reconstruction
  • Log serialization issues for debugging

Cache Location

  • Default to module directory for isolation
  • Support environment variable override
  • Consider user permissions
  • Document cache location clearly

Usage Examples

Basic Cache Usage

from lmms_eval.caching.cache import load_from_cache, save_to_cache

# Try loading from cache
cache_key = f"{model_name}_{task_name}"
cached_results = load_from_cache(cache_key)

if cached_results is not None:
    # Use cached results
    requests = cached_results
else:
    # Run evaluation
    requests = run_evaluation(model, task)

    # Save to cache
    save_to_cache(cache_key, requests)

Environment Variables

# Override cache location
export LM_HARNESS_CACHE_PATH=/custom/cache/path

# Run evaluation (uses custom cache)
python -m lmms_eval --model qwen25vl --tasks videomme

Cache Management

from lmms_eval.caching.cache import delete_cache

# Delete all cache files
delete_cache()

# Delete task-specific cache
delete_cache(key="videomme")

# Delete model-specific cache
delete_cache(key="qwen25vl")

Development Workflow

# First run: populate cache
python -m lmms_eval --model qwen25vl --tasks videomme

# Modify metrics/filters
# edit task YAML

# Second run: uses cache, recomputes metrics only
python -m lmms_eval --model qwen25vl --tasks videomme

A/B Testing Metrics

# Run once to cache responses
evaluate(model, tasks)

# Test metric variant A
metric_results_a = compute_metrics(cached_responses, metric_a)

# Test metric variant B (no re-inference)
metric_results_b = compute_metrics(cached_responses, metric_b)

Debugging

# Load cached requests for inspection
cached = load_from_cache("model_task")

for request_group in cached:
    for request in request_group:
        print(f"Input: {request.arguments}")
        print(f"Response: {request.response}")

Cache File Format

File Structure

.cache/
├── model_task.{hash}.pickle
├── another_task.{hash}.pickle
└── ...

Cached Object Structure

[
    [  # Request group 1
        Request(arguments=(...), response="...", ...),
        Request(arguments=(...), response="...", ...),
    ],
    [  # Request group 2
        Request(arguments=(...), response="...", ...),
    ],
]

Performance Considerations

Cache Hit Performance

  • Deserialization typically faster than inference
  • Pickle loading is I/O bound
  • Consider SSD for cache storage
  • Monitor cache file sizes

Cache Miss Performance

  • No overhead when cache does not exist
  • Quick file existence check
  • Minimal impact on evaluation speed

Cache Write Performance

  • Serialization after batch completion
  • Asynchronous writing possible
  • Monitor for serialization bottlenecks

Best Practices

  • Use descriptive cache keys (model_task format)
  • Document what triggers cache invalidation
  • Provide cache clearing utilities
  • Log cache hits/misses for monitoring
  • Handle serialization failures gracefully
  • Consider cache size limits
  • Clear cache when debugging metric changes
  • Include cache strategy in documentation

Limitations

Current Limitations

  • No automatic cache invalidation
  • No cache size limits
  • No cache compression
  • Manual cleanup required
  • Callable arguments replaced with None

Future Improvements

  • Automatic invalidation on task changes
  • LRU cache eviction
  • Compression for large caches
  • Better callable serialization
  • Cache statistics and monitoring

Integration Points

Command-Line Interface

# Clear cache before run
python -m lmms_eval --model qwen25vl --tasks videomme --clear-cache

# Run without caching
python -m lmms_eval --model qwen25vl --tasks videomme --no-cache

Programmatic Usage

from lmms_eval.caching.cache import load_from_cache, save_to_cache, delete_cache

# In evaluation loop
if use_cache:
    cached = load_from_cache(cache_key)
    if cached:
        return cached

results = evaluate(...)

if use_cache:
    save_to_cache(cache_key, results)

Related Pages

Implementations

See Also

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment