Heuristic:Marker Inc Korea AutoRAG GPU Memory Cleanup Pattern
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Deep_Learning |
| Last Updated | 2026-02-08 06:00 GMT |
Overview
GPU memory management pattern using explicit `empty_cuda_cache()` calls on module deletion to prevent VRAM leaks between pipeline stages.
Description
AutoRAG evaluates multiple modules per pipeline node sequentially. Each GPU-based module (rerankers, generators, compressors, embeddings) loads a model into GPU VRAM. Without explicit cleanup, residual GPU memory from a previous module can cause OOM errors when the next module loads its model. AutoRAG implements a consistent pattern: every GPU module calls `empty_cuda_cache()` in its `__del__` method, which safely clears the CUDA memory cache. The utility function handles the case where PyTorch is not installed (CPU-only environments) by catching ImportError silently.
Usage
Be aware of this pattern when implementing custom modules for AutoRAG or when debugging GPU OOM errors that occur mid-pipeline. If a custom module loads GPU models, it must follow this cleanup pattern to avoid VRAM accumulation.
The Insight (Rule of Thumb)
- Action: Implement `__del__` in every GPU-based module that calls `del self.model` followed by `empty_cuda_cache()`.
- Value: Prevents VRAM accumulation across sequential module evaluations.
- Trade-off: Adds slight overhead on module teardown; model must be re-loaded if needed again.
- Pattern: The `empty_cuda_cache()` utility is CPU-safe (no-op when torch unavailable).
- vLLM extra: vLLM modules additionally call `destroy_model_parallel()` and `torch.cuda.synchronize()` for distributed cleanup.
Reasoning
During AutoRAG optimization, multiple modules are evaluated for each node. For example, a reranker node might evaluate ColBERT, MonoT5, and SentenceTransformer rerankers sequentially. Each loads a different model into GPU VRAM. Without cleanup, all three models would accumulate in memory. Since PyTorch's CUDA memory allocator caches freed memory, `torch.cuda.empty_cache()` is needed to actually return memory to the GPU.
The `empty_cuda_cache()` utility wraps this in a try/except to handle CPU-only environments gracefully, making the cleanup pattern universally applicable.
Code Evidence
Central cleanup utility in `autorag/utils/util.py:679-686`:
def empty_cuda_cache():
try:
import torch
if torch.cuda.is_available():
torch.cuda.empty_cache()
except ImportError:
pass
Reranker cleanup pattern in `autorag/nodes/passagereranker/sentence_transformer.py:44-48`:
def __del__(self):
del self.model
empty_cuda_cache()
super().__del__()
vLLM distributed cleanup in `autorag/nodes/generator/vllm.py:38-58`:
def __del__(self):
try:
import torch
if torch.cuda.is_available():
from vllm.distributed.parallel_state import (
destroy_model_parallel,
destroy_distributed_environment,
)
destroy_model_parallel()
destroy_distributed_environment()
torch.cuda.empty_cache()
torch.cuda.synchronize()
except ImportError:
del self.vllm_model
Modules using this pattern:
- `autorag/nodes/passagereranker/colbert.py`
- `autorag/nodes/passagereranker/sentence_transformer.py`
- `autorag/nodes/passagereranker/monot5.py`
- `autorag/nodes/passagereranker/koreranker.py`
- `autorag/nodes/passagereranker/flashrank.py`
- `autorag/nodes/passagereranker/flag_embedding.py`
- `autorag/nodes/passagereranker/flag_embedding_llm.py`
- `autorag/nodes/passagereranker/rankgpt.py`
- `autorag/nodes/passagereranker/upr.py`
- `autorag/nodes/passagereranker/tart/tart.py`
- `autorag/nodes/passagecompressor/longllmlingua.py`
- `autorag/nodes/generator/vllm.py`
- `autorag/embedding/vllm.py`