Environment:Marker Inc Korea AutoRAG VLLM Environment
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, Deep_Learning, RAG |
| Last Updated | 2026-02-08 06:00 GMT |
Overview
vLLM-based environment for high-throughput local model serving for both generation and embedding in AutoRAG.
Description
This environment provides vLLM (Very Large Language Model) serving capabilities for AutoRAG. vLLM enables efficient local model inference with features like PagedAttention, continuous batching, and tensor parallelism for multi-GPU setups. AutoRAG uses vLLM for two distinct purposes: (1) local text generation via the `vllm` generator module, and (2) local embedding via the `VLLMEmbedding` class. Both require NVIDIA GPUs with sufficient VRAM.
Usage
Use this environment when running local LLM generation or local embedding without relying on external APIs. It provides higher throughput and lower latency than API-based alternatives for production deployments. Required when using `module_type: vllm` or `module_type: vllm_api` in YAML configuration.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux (Ubuntu 20.04+) | vLLM does not support Windows or macOS natively |
| Hardware | NVIDIA GPU | Minimum 16GB VRAM for 7B models; A100/H100 recommended |
| CUDA | Compatible with vLLM >= 0.11.0 | Check vLLM CUDA compatibility |
| RAM | 32GB+ | For model weight loading |
| Disk | 20GB+ SSD | For model downloads |
Dependencies
System Packages
- NVIDIA GPU drivers (latest stable)
- CUDA toolkit (vLLM-compatible version)
Python Packages
- `vllm` >= 0.11.0
- `torch` >= 2.7.1
- All packages from Environment:Marker_Inc_Korea_AutoRAG_GPU_PyTorch_Environment
Credentials
- `HF_TOKEN`: Optional. Required only for gated HuggingFace models (e.g., Llama, Mistral).
Quick Install
# Install AutoRAG with GPU support (includes vLLM)
pip install "AutoRAG[gpu]"
# Or install vLLM separately
pip install vllm>=0.11.0
Code Evidence
vLLM import guard in `autorag/nodes/generator/vllm.py:15-20`:
try:
from vllm import SamplingParams, LLM
except ImportError:
raise ImportError(
"Please install vllm library. "
"You can install it by running `pip install vllm`."
)
vLLM embedding import guard in `autorag/embedding/vllm.py:75-81`:
try:
from vllm import LLM as VLLModel
except ImportError:
raise ImportError(
"Could not import vllm python package. "
"Please install it with `pip install vllm`."
)
Tensor parallelism configuration in `autorag/embedding/vllm.py:24-27`:
tensor_parallel_size: Optional[int] = Field(
default=1,
description="The number of GPUs to use for distributed execution "
"with tensor parallelism.",
)
GPU cleanup on deletion in `autorag/nodes/generator/vllm.py:38-58`:
def __del__(self):
try:
import torch
if torch.cuda.is_available():
from vllm.distributed.parallel_state import (
destroy_model_parallel,
destroy_distributed_environment,
)
destroy_model_parallel()
destroy_distributed_environment()
torch.cuda.empty_cache()
torch.cuda.synchronize()
except ImportError:
del self.vllm_model
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `ImportError: Please install vllm library` | vLLM not installed | `pip install vllm` |
| `CUDA out of memory` | Model too large for GPU VRAM | Reduce `tensor_parallel_size` or use quantized model |
| `parameter logprob does not effective` | vLLM API forces logprobs=True | Informational warning only; no action needed |
| `parameter n does not effective` | vLLM API forces n=1 | Informational warning only; no action needed |
Compatibility Notes
- Linux only: vLLM requires Linux. Windows and macOS are not supported.
- Multi-GPU: Set `tensor_parallel_size` > 1 for distributed inference across multiple GPUs.
- vLLM API mode: The `vllm_api` generator connects to a running vLLM server rather than loading the model in-process. This avoids GPU memory contention with other modules.
- Cleanup: AutoRAG explicitly destroys distributed environments and empties CUDA cache on generator deletion to prevent memory leaks.