Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Heuristic:EvolvingLMMs Lab Lmms eval Memory Cleanup After Inference

From Leeroopedia
Knowledge Sources
Domains GPU_Computing, Optimization
Last Updated 2026-02-14 00:00 GMT

Overview

Explicit garbage collection and CUDA cache clearing after inference batches to prevent OOM errors during long evaluation runs.

Description

Large multimodal models can accumulate significant GPU memory fragments during inference. PyTorch's CUDA memory allocator caches freed memory for reuse, but this cached memory is not available for other allocations. The lmms-eval framework explicitly calls gc.collect() followed by torch.cuda.empty_cache() at strategic points to reclaim this memory. This is particularly important when evaluating multiple tasks sequentially or when running on GPUs with limited VRAM.

Usage

This heuristic is automatically applied by the framework after inference. It is especially relevant when:

  • Evaluating multiple tasks in a single run
  • Running on GPUs with limited VRAM (e.g., 16-24GB)
  • Processing video/multimodal inputs that temporarily consume large buffers

The Insight (Rule of Thumb)

  • Action: Call gc.collect() then torch.cuda.empty_cache() after model inference and between tasks.
  • Value: Can reclaim hundreds of MB to several GB of fragmented GPU memory.
  • Trade-off: Small CPU overhead for garbage collection, but prevents catastrophic OOM errors mid-evaluation.

Reasoning

PyTorch's CUDA memory allocator pools freed GPU memory to avoid expensive cudaMalloc calls. While this improves throughput, it means that GPU memory reported as "free" by nvidia-smi may actually be held by the PyTorch allocator. During multi-task evaluation, intermediate tensors from one task can remain in the cache and compete with the next task's allocations. Explicit cache clearing forces PyTorch to release this memory back to CUDA, making it available for new allocations.

Code evidence from lmms_eval/api/model.py:330-331:

gc.collect()
torch.cuda.empty_cache()

Code evidence from lmms_eval/utils.py:923-924:

gc.collect()
torch.cuda.empty_cache()

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment