Principle:EvolvingLMMs Lab Lmms eval Response Caching
Appearance
| Knowledge Sources | |
|---|---|
| Domains | Caching, Performance Optimization |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
Response Caching provides disk-based caching of evaluation requests and responses to avoid redundant computation when re-running evaluations. This principle establishes how the framework serializes, stores, and retrieves cached evaluation data to improve efficiency and enable iterative development.
Theoretical Basis
Cache Purpose
Caching serves several key purposes:
- Avoid Redundant Inference: Skip expensive model inference for previously seen inputs
- Iterative Development: Test metric changes without re-running inference
- Cost Reduction: Minimize API calls for expensive models
- Reproducibility: Store exact model responses for analysis
- Debugging: Inspect cached responses without re-evaluation
Cache Scope
What gets cached:
- Request Context: Input text, images, videos, and other media
- Request Arguments: Model generation parameters
- Model Responses: Raw model outputs before filtering
- Task Context: Enough information to reconstruct evaluation
What does not get cached:
- Metric Computations: Computed fresh each time
- Aggregations: Recalculated from cached responses
- Filter Applications: Applied to cached responses
Cache Organization
Cache structure:
- Cache Directory: Configurable location (default:
module/.cache/) - File Names: Based on task/model identifiers
- File Suffix: Hash-based suffix for uniqueness
- Format: Pickled Python objects using
dill
Serialization Strategy
Handling non-serializable objects:
- Callable Detection: Identify functions and methods
- Argument Sanitization: Replace callables with
Nonein arguments - Fallback Handling: Convert to serializable alternatives
- Logging: Debug info for cache operations
Design Patterns
Cache Storage
- Directory Management: Create cache directory if needed
- File Naming: Consistent naming with hash suffix
- Pickle Protocol: Use
dillfor enhanced serialization - Error Handling: Graceful fallback for serialization failures
Cache Retrieval
- File Existence Check: Quick check before loading
- Deserialization: Load pickled objects with
dill - Validation: Ensure cached data is compatible
- Miss Handling: Return
Noneon cache miss
Cache Management
- Selective Deletion: Remove specific cached tasks
- Pattern Matching: Delete by key prefix
- Cache Invalidation: Clear cache when needed
- Environment Override: Custom cache path via env var
Cache Invalidation
When to invalidate cache:
- Model version changes
- Task definition changes
- Request construction changes
- Different random seeds (if applicable)
Storage Efficiency
- Use binary pickle format for space efficiency
- Consider compression for large caches
- Monitor cache directory size
- Implement cleanup strategies
Serialization Robustness
- Handle diverse object types (tensors, images, etc.)
- Gracefully handle non-serializable items
- Preserve enough context for reconstruction
- Log serialization issues for debugging
Cache Location
- Default to module directory for isolation
- Support environment variable override
- Consider user permissions
- Document cache location clearly
Usage Examples
Basic Cache Usage
from lmms_eval.caching.cache import load_from_cache, save_to_cache
# Try loading from cache
cache_key = f"{model_name}_{task_name}"
cached_results = load_from_cache(cache_key)
if cached_results is not None:
# Use cached results
requests = cached_results
else:
# Run evaluation
requests = run_evaluation(model, task)
# Save to cache
save_to_cache(cache_key, requests)
Environment Variables
# Override cache location
export LM_HARNESS_CACHE_PATH=/custom/cache/path
# Run evaluation (uses custom cache)
python -m lmms_eval --model qwen25vl --tasks videomme
Cache Management
from lmms_eval.caching.cache import delete_cache
# Delete all cache files
delete_cache()
# Delete task-specific cache
delete_cache(key="videomme")
# Delete model-specific cache
delete_cache(key="qwen25vl")
Development Workflow
# First run: populate cache
python -m lmms_eval --model qwen25vl --tasks videomme
# Modify metrics/filters
# edit task YAML
# Second run: uses cache, recomputes metrics only
python -m lmms_eval --model qwen25vl --tasks videomme
A/B Testing Metrics
# Run once to cache responses
evaluate(model, tasks)
# Test metric variant A
metric_results_a = compute_metrics(cached_responses, metric_a)
# Test metric variant B (no re-inference)
metric_results_b = compute_metrics(cached_responses, metric_b)
Debugging
# Load cached requests for inspection
cached = load_from_cache("model_task")
for request_group in cached:
for request in request_group:
print(f"Input: {request.arguments}")
print(f"Response: {request.response}")
Cache File Format
File Structure
.cache/
├── model_task.{hash}.pickle
├── another_task.{hash}.pickle
└── ...
Cached Object Structure
[
[ # Request group 1
Request(arguments=(...), response="...", ...),
Request(arguments=(...), response="...", ...),
],
[ # Request group 2
Request(arguments=(...), response="...", ...),
],
]
Performance Considerations
Cache Hit Performance
- Deserialization typically faster than inference
- Pickle loading is I/O bound
- Consider SSD for cache storage
- Monitor cache file sizes
Cache Miss Performance
- No overhead when cache does not exist
- Quick file existence check
- Minimal impact on evaluation speed
Cache Write Performance
- Serialization after batch completion
- Asynchronous writing possible
- Monitor for serialization bottlenecks
Best Practices
- Use descriptive cache keys (model_task format)
- Document what triggers cache invalidation
- Provide cache clearing utilities
- Log cache hits/misses for monitoring
- Handle serialization failures gracefully
- Consider cache size limits
- Clear cache when debugging metric changes
- Include cache strategy in documentation
Limitations
Current Limitations
- No automatic cache invalidation
- No cache size limits
- No cache compression
- Manual cleanup required
- Callable arguments replaced with
None
Future Improvements
- Automatic invalidation on task changes
- LRU cache eviction
- Compression for large caches
- Better callable serialization
- Cache statistics and monitoring
Integration Points
Command-Line Interface
# Clear cache before run
python -m lmms_eval --model qwen25vl --tasks videomme --clear-cache
# Run without caching
python -m lmms_eval --model qwen25vl --tasks videomme --no-cache
Programmatic Usage
from lmms_eval.caching.cache import load_from_cache, save_to_cache, delete_cache
# In evaluation loop
if use_cache:
cached = load_from_cache(cache_key)
if cached:
return cached
results = evaluate(...)
if use_cache:
save_to_cache(cache_key, results)
Related Pages
Implementations
- EvolvingLMMs_Lab_Lmms_eval_Cache_Utils — core caching utilities
- Implementation:EvolvingLMMs_Lab_Lmms_eval_Cache_Utils
See Also
- EvolvingLMMs_Lab_Lmms_eval_Model_Inference — caching occurs after model inference
- EvolvingLMMs_Lab_Lmms_eval_Request_Construction — request structure determines cache key
- EvolvingLMMs_Lab_Lmms_eval_Post_Processing_and_Metrics — metrics computed from cached responses
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment