Heuristic:EvolvingLMMs Lab Lmms eval Memory Cleanup After Inference
| Knowledge Sources | |
|---|---|
| Domains | GPU_Computing, Optimization |
| Last Updated | 2026-02-14 00:00 GMT |
Overview
Explicit garbage collection and CUDA cache clearing after inference batches to prevent OOM errors during long evaluation runs.
Description
Large multimodal models can accumulate significant GPU memory fragments during inference. PyTorch's CUDA memory allocator caches freed memory for reuse, but this cached memory is not available for other allocations. The lmms-eval framework explicitly calls gc.collect() followed by torch.cuda.empty_cache() at strategic points to reclaim this memory. This is particularly important when evaluating multiple tasks sequentially or when running on GPUs with limited VRAM.
Usage
This heuristic is automatically applied by the framework after inference. It is especially relevant when:
- Evaluating multiple tasks in a single run
- Running on GPUs with limited VRAM (e.g., 16-24GB)
- Processing video/multimodal inputs that temporarily consume large buffers
The Insight (Rule of Thumb)
- Action: Call
gc.collect()thentorch.cuda.empty_cache()after model inference and between tasks. - Value: Can reclaim hundreds of MB to several GB of fragmented GPU memory.
- Trade-off: Small CPU overhead for garbage collection, but prevents catastrophic OOM errors mid-evaluation.
Reasoning
PyTorch's CUDA memory allocator pools freed GPU memory to avoid expensive cudaMalloc calls. While this improves throughput, it means that GPU memory reported as "free" by nvidia-smi may actually be held by the PyTorch allocator. During multi-task evaluation, intermediate tensors from one task can remain in the cache and compete with the next task's allocations. Explicit cache clearing forces PyTorch to release this memory back to CUDA, making it available for new allocations.
Code evidence from lmms_eval/api/model.py:330-331:
gc.collect()
torch.cuda.empty_cache()
Code evidence from lmms_eval/utils.py:923-924:
gc.collect()
torch.cuda.empty_cache()