Environment:Marker Inc Korea AutoRAG VLLM Environment

Knowledge Sources	AutoRAG vLLM
Domains	Infrastructure, Deep_Learning, RAG
Last Updated	2026-02-08 06:00 GMT

Overview

vLLM-based environment for high-throughput local model serving for both generation and embedding in AutoRAG.

Description

This environment provides vLLM (Very Large Language Model) serving capabilities for AutoRAG. vLLM enables efficient local model inference with features like PagedAttention, continuous batching, and tensor parallelism for multi-GPU setups. AutoRAG uses vLLM for two distinct purposes: (1) local text generation via the `vllm` generator module, and (2) local embedding via the `VLLMEmbedding` class. Both require NVIDIA GPUs with sufficient VRAM.

Usage

Use this environment when running local LLM generation or local embedding without relying on external APIs. It provides higher throughput and lower latency than API-based alternatives for production deployments. Required when using `module_type: vllm` or `module_type: vllm_api` in YAML configuration.

System Requirements

Category	Requirement	Notes
OS	Linux (Ubuntu 20.04+)	vLLM does not support Windows or macOS natively
Hardware	NVIDIA GPU	Minimum 16GB VRAM for 7B models; A100/H100 recommended
CUDA	Compatible with vLLM >= 0.11.0	Check vLLM CUDA compatibility
RAM	32GB+	For model weight loading
Disk	20GB+ SSD	For model downloads

Dependencies

System Packages

NVIDIA GPU drivers (latest stable)
CUDA toolkit (vLLM-compatible version)

Python Packages

`vllm` >= 0.11.0
`torch` >= 2.7.1
All packages from Environment:Marker_Inc_Korea_AutoRAG_GPU_PyTorch_Environment

Credentials

`HF_TOKEN`: Optional. Required only for gated HuggingFace models (e.g., Llama, Mistral).

Quick Install

# Install AutoRAG with GPU support (includes vLLM)
pip install "AutoRAG[gpu]"

# Or install vLLM separately
pip install vllm>=0.11.0

Code Evidence

vLLM import guard in `autorag/nodes/generator/vllm.py:15-20`:

try:
    from vllm import SamplingParams, LLM
except ImportError:
    raise ImportError(
        "Please install vllm library. "
        "You can install it by running `pip install vllm`."
    )

vLLM embedding import guard in `autorag/embedding/vllm.py:75-81`:

try:
    from vllm import LLM as VLLModel
except ImportError:
    raise ImportError(
        "Could not import vllm python package. "
        "Please install it with `pip install vllm`."
    )

Tensor parallelism configuration in `autorag/embedding/vllm.py:24-27`:

tensor_parallel_size: Optional[int] = Field(
    default=1,
    description="The number of GPUs to use for distributed execution "
                "with tensor parallelism.",
)

GPU cleanup on deletion in `autorag/nodes/generator/vllm.py:38-58`:

def __del__(self):
    try:
        import torch
        if torch.cuda.is_available():
            from vllm.distributed.parallel_state import (
                destroy_model_parallel,
                destroy_distributed_environment,
            )
            destroy_model_parallel()
            destroy_distributed_environment()
            torch.cuda.empty_cache()
            torch.cuda.synchronize()
    except ImportError:
        del self.vllm_model

Common Errors

Error Message	Cause	Solution
`ImportError: Please install vllm library`	vLLM not installed	`pip install vllm`
`CUDA out of memory`	Model too large for GPU VRAM	Reduce `tensor_parallel_size` or use quantized model
`parameter logprob does not effective`	vLLM API forces logprobs=True	Informational warning only; no action needed
`parameter n does not effective`	vLLM API forces n=1	Informational warning only; no action needed

Compatibility Notes

Linux only: vLLM requires Linux. Windows and macOS are not supported.
Multi-GPU: Set `tensor_parallel_size` > 1 for distributed inference across multiple GPUs.
vLLM API mode: The `vllm_api` generator connects to a running vLLM server rather than loading the model in-process. This avoids GPU memory contention with other modules.
Cleanup: AutoRAG explicitly destroys distributed environments and empties CUDA cache on generator deletion to prevent memory leaks.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment