Environment:Pytorch Serve vLLM Engine Environment
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Infrastructure |
| Last Updated | 2026-02-13 00:00 GMT |
Overview
vLLM engine environment with AsyncLLMEngine, OpenAI-compatible API, and GPU inference for serving large language models via TorchServe.
Description
This environment extends the CUDA GPU environment with the vLLM inference engine for high-throughput LLM serving. vLLM provides PagedAttention for efficient KV cache management, continuous batching, and an OpenAI-compatible API. The TorchServe VLLMHandler wraps vLLM's `AsyncLLMEngine` and exposes chat completions and text completions endpoints. The handler forces the `spawn` multiprocessing method for clean worker process isolation.
Usage
Use this environment when deploying large language models via TorchServe with the vLLM backend. Required for the LLM Deployment (vLLM) workflow and any model configuration that specifies `engine: vllm` in the LLM launcher.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux (Ubuntu 20.04+) | vLLM has limited Windows/macOS support |
| Hardware | NVIDIA GPU | Minimum 16GB VRAM recommended for 7B models |
| VRAM | 16-80GB | Depends on model size and `max_num_seqs` |
| Disk | 50GB+ | For model weights download and caching |
Dependencies
System Packages
- NVIDIA GPU driver >= 525
- CUDA Toolkit >= 11.8
Python Packages
- `vllm` (provides AsyncEngineArgs, AsyncLLMEngine, OpenAI protocol classes)
- `torch` with CUDA support
- `torchserve`
Credentials
- `VLLM_WORKER_MULTIPROC_METHOD`: Automatically set to `spawn` by the handler. Do not override.
- `HF_TOKEN` or `HUGGING_FACE_HUB_TOKEN`: Required for gated models on HuggingFace Hub.
Quick Install
# Install vLLM
pip install vllm
# Install TorchServe
pip install torchserve torch-model-archiver
Code Evidence
vLLM imports from `ts/torch_handler/vllm_handler.py:8-16`:
from vllm import AsyncEngineArgs, AsyncLLMEngine
from vllm.entrypoints.openai.protocol import (
ChatCompletionRequest,
CompletionRequest,
ErrorResponse,
)
from vllm.entrypoints.openai.serving_chat import OpenAIServingChat
from vllm.entrypoints.openai.serving_completion import OpenAIServingCompletion
from vllm.entrypoints.openai.serving_engine import LoRAModulePath
Multiprocessing method enforcement from `ts/torch_handler/vllm_handler.py:45`:
os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"
Tensor parallel size auto-detection from `ts/llm_launcher.py:92-94`:
"tensor_parallel_size": torch.cuda.device_count()
if torch.cuda.is_available
else 1,
Batch size assertion from `ts/torch_handler/vllm_handler.py:109`:
assert len(requests) == 1, "Expecting batch_size = 1"
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `ImportError: No module named 'vllm'` | vLLM not installed | `pip install vllm` |
| `Expecting batch_size = 1` | TorchServe batch_size > 1 | Set `batchSize: 1` in model config; vLLM handles internal batching via `max_num_seqs` |
| `Unknown API endpoint: ...` | Invalid URL path | Use `v1/completions`, `v1/chat/completions`, or `v1/models` endpoints |
| CUDA out of memory during vLLM init | Model too large for GPU | Reduce `max_num_seqs`, increase `tensor_parallel_size`, or use larger GPU |
Compatibility Notes
- Tensor Parallelism: Automatically set to `torch.cuda.device_count()` by LLM launcher. Multi-GPU setups split model across all visible GPUs.
- LoRA Adapters: Supported via `adapters` config in handler YAML. LoRA IDs are passed to vLLM engine.
- Streaming: Supported via `send_intermediate_predict_response()` for token-by-token generation.
- OpenAI Compatibility: Exposes `v1/chat/completions` and `v1/completions` endpoints matching OpenAI API format.