Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Environment:Marker Inc Korea AutoRAG VLLM Environment

From Leeroopedia


Knowledge Sources
Domains Infrastructure, Deep_Learning, RAG
Last Updated 2026-02-08 06:00 GMT

Overview

vLLM-based environment for high-throughput local model serving for both generation and embedding in AutoRAG.

Description

This environment provides vLLM (Very Large Language Model) serving capabilities for AutoRAG. vLLM enables efficient local model inference with features like PagedAttention, continuous batching, and tensor parallelism for multi-GPU setups. AutoRAG uses vLLM for two distinct purposes: (1) local text generation via the `vllm` generator module, and (2) local embedding via the `VLLMEmbedding` class. Both require NVIDIA GPUs with sufficient VRAM.

Usage

Use this environment when running local LLM generation or local embedding without relying on external APIs. It provides higher throughput and lower latency than API-based alternatives for production deployments. Required when using `module_type: vllm` or `module_type: vllm_api` in YAML configuration.

System Requirements

Category Requirement Notes
OS Linux (Ubuntu 20.04+) vLLM does not support Windows or macOS natively
Hardware NVIDIA GPU Minimum 16GB VRAM for 7B models; A100/H100 recommended
CUDA Compatible with vLLM >= 0.11.0 Check vLLM CUDA compatibility
RAM 32GB+ For model weight loading
Disk 20GB+ SSD For model downloads

Dependencies

System Packages

  • NVIDIA GPU drivers (latest stable)
  • CUDA toolkit (vLLM-compatible version)

Python Packages

Credentials

  • `HF_TOKEN`: Optional. Required only for gated HuggingFace models (e.g., Llama, Mistral).

Quick Install

# Install AutoRAG with GPU support (includes vLLM)
pip install "AutoRAG[gpu]"

# Or install vLLM separately
pip install vllm>=0.11.0

Code Evidence

vLLM import guard in `autorag/nodes/generator/vllm.py:15-20`:

try:
    from vllm import SamplingParams, LLM
except ImportError:
    raise ImportError(
        "Please install vllm library. "
        "You can install it by running `pip install vllm`."
    )

vLLM embedding import guard in `autorag/embedding/vllm.py:75-81`:

try:
    from vllm import LLM as VLLModel
except ImportError:
    raise ImportError(
        "Could not import vllm python package. "
        "Please install it with `pip install vllm`."
    )

Tensor parallelism configuration in `autorag/embedding/vllm.py:24-27`:

tensor_parallel_size: Optional[int] = Field(
    default=1,
    description="The number of GPUs to use for distributed execution "
                "with tensor parallelism.",
)

GPU cleanup on deletion in `autorag/nodes/generator/vllm.py:38-58`:

def __del__(self):
    try:
        import torch
        if torch.cuda.is_available():
            from vllm.distributed.parallel_state import (
                destroy_model_parallel,
                destroy_distributed_environment,
            )
            destroy_model_parallel()
            destroy_distributed_environment()
            torch.cuda.empty_cache()
            torch.cuda.synchronize()
    except ImportError:
        del self.vllm_model

Common Errors

Error Message Cause Solution
`ImportError: Please install vllm library` vLLM not installed `pip install vllm`
`CUDA out of memory` Model too large for GPU VRAM Reduce `tensor_parallel_size` or use quantized model
`parameter logprob does not effective` vLLM API forces logprobs=True Informational warning only; no action needed
`parameter n does not effective` vLLM API forces n=1 Informational warning only; no action needed

Compatibility Notes

  • Linux only: vLLM requires Linux. Windows and macOS are not supported.
  • Multi-GPU: Set `tensor_parallel_size` > 1 for distributed inference across multiple GPUs.
  • vLLM API mode: The `vllm_api` generator connects to a running vLLM server rather than loading the model in-process. This avoids GPU memory contention with other modules.
  • Cleanup: AutoRAG explicitly destroys distributed environments and empties CUDA cache on generator deletion to prevent memory leaks.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment