Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:Marker Inc Korea AutoRAG GPU PyTorch Environment

From Leeroopedia
Knowledge Sources
Domains Infrastructure, Deep_Learning, RAG
Last Updated 2026-02-12 00:00 GMT

Overview

GPU-accelerated environment with PyTorch, Transformers, vLLM, and local model inference libraries for running AutoRAG's rerankers, embeddings, and local generators.

Description

This environment extends the base Python runtime with the `AutoRAG[gpu]` optional extra. It provides PyTorch for CUDA-based inference, HuggingFace Transformers for model loading, vLLM for high-throughput local LLM serving, sentence-transformers for cross-encoder reranking, FlagEmbedding for BAAI rerankers, ONNX Runtime for FlashRank inference, and LLMLingua for passage compression. All local reranker modules (ColBERT, MonoT5, KoReranker, TART, UPR, SentenceTransformer, FlagEmbedding, OpenVINO, FlashRank) require this environment. Device selection is automatic: CUDA if available, otherwise CPU fallback.

Usage

Use this environment when running local model inference for reranking, embedding, or text generation. It is required for any pipeline that uses non-API-based modules such as ColBERT reranker, MonoT5, KoReranker, TART, FlashRank, FlagEmbedding, SentenceTransformer reranker, UPR, LongLLMLingua compressor, or vLLM generator.

System Requirements

Category Requirement Notes
OS Linux (recommended) CUDA support best on Linux; macOS CPU-only
Hardware NVIDIA GPU (recommended) CUDA-capable GPU for acceleration; CPU fallback available
VRAM 4GB+ minimum Depends on model size; rerankers need 2-8GB, vLLM generators need 16GB+
Python >= 3.10 Same as base environment

Dependencies

GPU Extra Packages

  • `torch` >= 2.7.1
  • `sentencepiece` >= 0.2.0
  • `bert_score` >= 0.3.13
  • `peft` >= 0.15.2
  • `llmlingua` >= 0.2.2
  • `FlagEmbedding` >= 1.2.11
  • `sentence-transformers` >= 4.1.0
  • `transformers` >= 4.51.3
  • `onnxruntime` >= 1.22.0
  • `vllm` >= 0.11.0

Additional LlamaIndex Integrations

  • `llama-index-llms-ollama` >= 0.6.0
  • `llama-index-embeddings-huggingface` >= 0.5.4
  • `llama-index-llms-huggingface` >= 0.5.0

Credentials

No additional credentials required beyond the base environment. Local models are loaded from HuggingFace Hub (public models) or local paths.

Quick Install

# Install AutoRAG with GPU support
pip install "AutoRAG[gpu]"

# Or install everything
pip install "AutoRAG[all]"

Code Evidence

Device auto-detection from `autorag/nodes/passagereranker/colbert.py:42`:

self.device = "cuda" if torch.cuda.is_available() else "cpu"

PyTorch import guard from `autorag/nodes/passagereranker/colbert.py:35-41`:

try:
    import torch
    from transformers import AutoModel, AutoTokenizer
except ImportError:
    raise ImportError(
        "Pytorch is not installed. Please install pytorch to use Colbert reranker."
    )

GPU module gating from `autorag/__init__.py:61-72`:

try:
    from llama_index.llms.huggingface import HuggingFaceLLM
    from llama_index.llms.ollama import Ollama
    generator_models["huggingfacellm"] = HuggingFaceLLM
    generator_models["ollama"] = Ollama
except ImportError:
    logger.info(
        "You are using API version of AutoRAG."
        "To use local version, run pip install 'AutoRAG[gpu]'"
    )

CUDA cache cleanup from `autorag/utils/util.py:679-686`:

def empty_cuda_cache():
    try:
        import torch
        if torch.cuda.is_available():
            torch.cuda.empty_cache()
    except ImportError:
        pass

vLLM CUDA cleanup from `autorag/embedding/vllm.py:108`:

if torch.cuda.is_available():
    gc.collect()
    torch.cuda.empty_cache()
    torch.cuda.synchronize()

Common Errors

Error Message Cause Solution
`ImportError: Pytorch is not installed` torch not in environment `pip install "AutoRAG[gpu]"`
`ImportError: FlagEmbeddingReranker requires the 'FlagEmbedding' package` FlagEmbedding missing `pip install FlagEmbedding>=1.2.11`
`You have to install AutoRAG[gpu] to use SentenceTransformerReranker` sentence-transformers missing `pip install "AutoRAG[gpu]"`
`Please install vllm library` vLLM not installed `pip install vllm>=0.11.0`
`CUDA out of memory` Insufficient GPU VRAM Use smaller model or reduce batch size

Compatibility Notes

  • CPU fallback: All modules that check `torch.cuda.is_available()` automatically fall back to CPU if no GPU is detected. Performance will be significantly slower.
  • vLLM: Requires CUDA-capable GPU; does not support CPU-only mode. Linux only.
  • OpenVINO reranker: Alternative to CUDA for Intel hardware; uses ONNX Runtime backend.
  • FlashRank: Uses ONNX Runtime, not PyTorch directly. Works on CPU efficiently.
  • vLLM version compatibility: Code handles both vLLM >= 0.11 (`vllm.logprobs.SampleLogprobs`) and older versions (`vllm.sequence.SampleLogprobs`).

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment