Environment:Marker Inc Korea AutoRAG GPU PyTorch Environment

Knowledge Sources	AutoRAG pyproject.toml gpu extra
Domains	Infrastructure, Deep_Learning, RAG
Last Updated	2026-02-12 00:00 GMT

Overview

GPU-accelerated environment with PyTorch, Transformers, vLLM, and local model inference libraries for running AutoRAG's rerankers, embeddings, and local generators.

Description

This environment extends the base Python runtime with the `AutoRAG[gpu]` optional extra. It provides PyTorch for CUDA-based inference, HuggingFace Transformers for model loading, vLLM for high-throughput local LLM serving, sentence-transformers for cross-encoder reranking, FlagEmbedding for BAAI rerankers, ONNX Runtime for FlashRank inference, and LLMLingua for passage compression. All local reranker modules (ColBERT, MonoT5, KoReranker, TART, UPR, SentenceTransformer, FlagEmbedding, OpenVINO, FlashRank) require this environment. Device selection is automatic: CUDA if available, otherwise CPU fallback.

Usage

Use this environment when running local model inference for reranking, embedding, or text generation. It is required for any pipeline that uses non-API-based modules such as ColBERT reranker, MonoT5, KoReranker, TART, FlashRank, FlagEmbedding, SentenceTransformer reranker, UPR, LongLLMLingua compressor, or vLLM generator.

System Requirements

Category	Requirement	Notes
OS	Linux (recommended)	CUDA support best on Linux; macOS CPU-only
Hardware	NVIDIA GPU (recommended)	CUDA-capable GPU for acceleration; CPU fallback available
VRAM	4GB+ minimum	Depends on model size; rerankers need 2-8GB, vLLM generators need 16GB+
Python	>= 3.10	Same as base environment

Dependencies

GPU Extra Packages

`torch` >= 2.7.1
`sentencepiece` >= 0.2.0
`bert_score` >= 0.3.13
`peft` >= 0.15.2
`llmlingua` >= 0.2.2
`FlagEmbedding` >= 1.2.11
`sentence-transformers` >= 4.1.0
`transformers` >= 4.51.3
`onnxruntime` >= 1.22.0
`vllm` >= 0.11.0

Additional LlamaIndex Integrations

`llama-index-llms-ollama` >= 0.6.0
`llama-index-embeddings-huggingface` >= 0.5.4
`llama-index-llms-huggingface` >= 0.5.0

Credentials

No additional credentials required beyond the base environment. Local models are loaded from HuggingFace Hub (public models) or local paths.

Quick Install

# Install AutoRAG with GPU support
pip install "AutoRAG[gpu]"

# Or install everything
pip install "AutoRAG[all]"

Code Evidence

Device auto-detection from `autorag/nodes/passagereranker/colbert.py:42`:

self.device = "cuda" if torch.cuda.is_available() else "cpu"

PyTorch import guard from `autorag/nodes/passagereranker/colbert.py:35-41`:

try:
    import torch
    from transformers import AutoModel, AutoTokenizer
except ImportError:
    raise ImportError(
        "Pytorch is not installed. Please install pytorch to use Colbert reranker."
    )

GPU module gating from `autorag/__init__.py:61-72`:

try:
    from llama_index.llms.huggingface import HuggingFaceLLM
    from llama_index.llms.ollama import Ollama
    generator_models["huggingfacellm"] = HuggingFaceLLM
    generator_models["ollama"] = Ollama
except ImportError:
    logger.info(
        "You are using API version of AutoRAG."
        "To use local version, run pip install 'AutoRAG[gpu]'"
    )

CUDA cache cleanup from `autorag/utils/util.py:679-686`:

def empty_cuda_cache():
    try:
        import torch
        if torch.cuda.is_available():
            torch.cuda.empty_cache()
    except ImportError:
        pass

vLLM CUDA cleanup from `autorag/embedding/vllm.py:108`:

if torch.cuda.is_available():
    gc.collect()
    torch.cuda.empty_cache()
    torch.cuda.synchronize()

Common Errors

Error Message	Cause	Solution
`ImportError: Pytorch is not installed`	torch not in environment	`pip install "AutoRAG[gpu]"`
`ImportError: FlagEmbeddingReranker requires the 'FlagEmbedding' package`	FlagEmbedding missing	`pip install FlagEmbedding>=1.2.11`
`You have to install AutoRAG[gpu] to use SentenceTransformerReranker`	sentence-transformers missing	`pip install "AutoRAG[gpu]"`
`Please install vllm library`	vLLM not installed	`pip install vllm>=0.11.0`
`CUDA out of memory`	Insufficient GPU VRAM	Use smaller model or reduce batch size

Compatibility Notes

CPU fallback: All modules that check `torch.cuda.is_available()` automatically fall back to CPU if no GPU is detected. Performance will be significantly slower.
vLLM: Requires CUDA-capable GPU; does not support CPU-only mode. Linux only.
OpenVINO reranker: Alternative to CUDA for Intel hardware; uses ONNX Runtime backend.
FlashRank: Uses ONNX Runtime, not PyTorch directly. Works on CPU efficiently.
vLLM version compatibility: Code handles both vLLM >= 0.11 (`vllm.logprobs.SampleLogprobs`) and older versions (`vllm.sequence.SampleLogprobs`).

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment