Environment:Marker Inc Korea AutoRAG GPU PyTorch Environment
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, Deep_Learning, RAG |
| Last Updated | 2026-02-12 00:00 GMT |
Overview
GPU-accelerated environment with PyTorch, Transformers, vLLM, and local model inference libraries for running AutoRAG's rerankers, embeddings, and local generators.
Description
This environment extends the base Python runtime with the `AutoRAG[gpu]` optional extra. It provides PyTorch for CUDA-based inference, HuggingFace Transformers for model loading, vLLM for high-throughput local LLM serving, sentence-transformers for cross-encoder reranking, FlagEmbedding for BAAI rerankers, ONNX Runtime for FlashRank inference, and LLMLingua for passage compression. All local reranker modules (ColBERT, MonoT5, KoReranker, TART, UPR, SentenceTransformer, FlagEmbedding, OpenVINO, FlashRank) require this environment. Device selection is automatic: CUDA if available, otherwise CPU fallback.
Usage
Use this environment when running local model inference for reranking, embedding, or text generation. It is required for any pipeline that uses non-API-based modules such as ColBERT reranker, MonoT5, KoReranker, TART, FlashRank, FlagEmbedding, SentenceTransformer reranker, UPR, LongLLMLingua compressor, or vLLM generator.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux (recommended) | CUDA support best on Linux; macOS CPU-only |
| Hardware | NVIDIA GPU (recommended) | CUDA-capable GPU for acceleration; CPU fallback available |
| VRAM | 4GB+ minimum | Depends on model size; rerankers need 2-8GB, vLLM generators need 16GB+ |
| Python | >= 3.10 | Same as base environment |
Dependencies
GPU Extra Packages
- `torch` >= 2.7.1
- `sentencepiece` >= 0.2.0
- `bert_score` >= 0.3.13
- `peft` >= 0.15.2
- `llmlingua` >= 0.2.2
- `FlagEmbedding` >= 1.2.11
- `sentence-transformers` >= 4.1.0
- `transformers` >= 4.51.3
- `onnxruntime` >= 1.22.0
- `vllm` >= 0.11.0
Additional LlamaIndex Integrations
- `llama-index-llms-ollama` >= 0.6.0
- `llama-index-embeddings-huggingface` >= 0.5.4
- `llama-index-llms-huggingface` >= 0.5.0
Credentials
No additional credentials required beyond the base environment. Local models are loaded from HuggingFace Hub (public models) or local paths.
Quick Install
# Install AutoRAG with GPU support
pip install "AutoRAG[gpu]"
# Or install everything
pip install "AutoRAG[all]"
Code Evidence
Device auto-detection from `autorag/nodes/passagereranker/colbert.py:42`:
self.device = "cuda" if torch.cuda.is_available() else "cpu"
PyTorch import guard from `autorag/nodes/passagereranker/colbert.py:35-41`:
try:
import torch
from transformers import AutoModel, AutoTokenizer
except ImportError:
raise ImportError(
"Pytorch is not installed. Please install pytorch to use Colbert reranker."
)
GPU module gating from `autorag/__init__.py:61-72`:
try:
from llama_index.llms.huggingface import HuggingFaceLLM
from llama_index.llms.ollama import Ollama
generator_models["huggingfacellm"] = HuggingFaceLLM
generator_models["ollama"] = Ollama
except ImportError:
logger.info(
"You are using API version of AutoRAG."
"To use local version, run pip install 'AutoRAG[gpu]'"
)
CUDA cache cleanup from `autorag/utils/util.py:679-686`:
def empty_cuda_cache():
try:
import torch
if torch.cuda.is_available():
torch.cuda.empty_cache()
except ImportError:
pass
vLLM CUDA cleanup from `autorag/embedding/vllm.py:108`:
if torch.cuda.is_available():
gc.collect()
torch.cuda.empty_cache()
torch.cuda.synchronize()
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `ImportError: Pytorch is not installed` | torch not in environment | `pip install "AutoRAG[gpu]"` |
| `ImportError: FlagEmbeddingReranker requires the 'FlagEmbedding' package` | FlagEmbedding missing | `pip install FlagEmbedding>=1.2.11` |
| `You have to install AutoRAG[gpu] to use SentenceTransformerReranker` | sentence-transformers missing | `pip install "AutoRAG[gpu]"` |
| `Please install vllm library` | vLLM not installed | `pip install vllm>=0.11.0` |
| `CUDA out of memory` | Insufficient GPU VRAM | Use smaller model or reduce batch size |
Compatibility Notes
- CPU fallback: All modules that check `torch.cuda.is_available()` automatically fall back to CPU if no GPU is detected. Performance will be significantly slower.
- vLLM: Requires CUDA-capable GPU; does not support CPU-only mode. Linux only.
- OpenVINO reranker: Alternative to CUDA for Intel hardware; uses ONNX Runtime backend.
- FlashRank: Uses ONNX Runtime, not PyTorch directly. Works on CPU efficiently.
- vLLM version compatibility: Code handles both vLLM >= 0.11 (`vllm.logprobs.SampleLogprobs`) and older versions (`vllm.sequence.SampleLogprobs`).