Heuristic:Marker Inc Korea AutoRAG GPU Memory Cleanup Pattern

Knowledge Sources	AutoRAG
Domains	Optimization, Deep_Learning
Last Updated	2026-02-08 06:00 GMT

Overview

GPU memory management pattern using explicit `empty_cuda_cache()` calls on module deletion to prevent VRAM leaks between pipeline stages.

Description

AutoRAG evaluates multiple modules per pipeline node sequentially. Each GPU-based module (rerankers, generators, compressors, embeddings) loads a model into GPU VRAM. Without explicit cleanup, residual GPU memory from a previous module can cause OOM errors when the next module loads its model. AutoRAG implements a consistent pattern: every GPU module calls `empty_cuda_cache()` in its `__del__` method, which safely clears the CUDA memory cache. The utility function handles the case where PyTorch is not installed (CPU-only environments) by catching ImportError silently.

Usage

Be aware of this pattern when implementing custom modules for AutoRAG or when debugging GPU OOM errors that occur mid-pipeline. If a custom module loads GPU models, it must follow this cleanup pattern to avoid VRAM accumulation.

The Insight (Rule of Thumb)

Action: Implement `__del__` in every GPU-based module that calls `del self.model` followed by `empty_cuda_cache()`.
Value: Prevents VRAM accumulation across sequential module evaluations.
Trade-off: Adds slight overhead on module teardown; model must be re-loaded if needed again.
Pattern: The `empty_cuda_cache()` utility is CPU-safe (no-op when torch unavailable).
vLLM extra: vLLM modules additionally call `destroy_model_parallel()` and `torch.cuda.synchronize()` for distributed cleanup.

Reasoning

During AutoRAG optimization, multiple modules are evaluated for each node. For example, a reranker node might evaluate ColBERT, MonoT5, and SentenceTransformer rerankers sequentially. Each loads a different model into GPU VRAM. Without cleanup, all three models would accumulate in memory. Since PyTorch's CUDA memory allocator caches freed memory, `torch.cuda.empty_cache()` is needed to actually return memory to the GPU.

The `empty_cuda_cache()` utility wraps this in a try/except to handle CPU-only environments gracefully, making the cleanup pattern universally applicable.

Code Evidence

Central cleanup utility in `autorag/utils/util.py:679-686`:

def empty_cuda_cache():
    try:
        import torch
        if torch.cuda.is_available():
            torch.cuda.empty_cache()
    except ImportError:
        pass

Reranker cleanup pattern in `autorag/nodes/passagereranker/sentence_transformer.py:44-48`:

def __del__(self):
    del self.model
    empty_cuda_cache()
    super().__del__()

vLLM distributed cleanup in `autorag/nodes/generator/vllm.py:38-58`:

def __del__(self):
    try:
        import torch
        if torch.cuda.is_available():
            from vllm.distributed.parallel_state import (
                destroy_model_parallel,
                destroy_distributed_environment,
            )
            destroy_model_parallel()
            destroy_distributed_environment()
            torch.cuda.empty_cache()
            torch.cuda.synchronize()
    except ImportError:
        del self.vllm_model

Modules using this pattern:

`autorag/nodes/passagereranker/colbert.py`
`autorag/nodes/passagereranker/sentence_transformer.py`
`autorag/nodes/passagereranker/monot5.py`
`autorag/nodes/passagereranker/koreranker.py`
`autorag/nodes/passagereranker/flashrank.py`
`autorag/nodes/passagereranker/flag_embedding.py`
`autorag/nodes/passagereranker/flag_embedding_llm.py`
`autorag/nodes/passagereranker/rankgpt.py`
`autorag/nodes/passagereranker/upr.py`
`autorag/nodes/passagereranker/tart/tart.py`
`autorag/nodes/passagecompressor/longllmlingua.py`
`autorag/nodes/generator/vllm.py`
`autorag/embedding/vllm.py`

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment