Environment:Kserve Kserve VLLM Runtime
| Knowledge Sources | |
|---|---|
| Domains | LLM_Serving, GPU_Computing |
| Last Updated | 2026-02-13 14:00 GMT |
Overview
vLLM 0.11.2 inference engine for high-throughput LLM serving with PagedAttention, prefix caching, and disaggregated prefill-decode.
Description
vLLM is the default inference engine for LLMInferenceService deployments. It provides efficient GPU memory management through PagedAttention, automatic prefix caching for repeated prompt patterns, and support for disaggregated inference via NixlConnector for KV cache transfer. The HuggingFace server in KServe wraps vLLM for model serving.
Usage
Use this environment for LLM inference serving with GPU acceleration. Required for LLMInferenceService deployments and GPU-based HuggingFace model serving.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| Python | >= 3.10, < 3.14 | For HuggingFace server |
| CUDA | Compatible with vLLM build | vLLM includes CUDA kernels |
| GPU | NVIDIA with sufficient VRAM | Model-dependent |
| vLLM | 0.11.2 | From kserve pyproject.toml |
Dependencies
Python Packages
- `vllm` == 0.11.2
- `transformers` >= 4.53.2
- `accelerate` >= 1.6.0, < 2.0.0
- `bitsandbytes` >= 0.45.3
- `torch` (bundled with vLLM)
Credentials
- `HF_TOKEN`: HuggingFace API token for gated model downloads (e.g., Llama, Qwen)
Quick Install
# vLLM is included in the KServe HuggingFace server image
# For local development:
pip install vllm==0.11.2
pip install transformers>=4.53.2 accelerate>=1.6.0 bitsandbytes>=0.45.3
Code Evidence
vLLM dependency from `python/kserve/pyproject.toml`:
[project.optional-dependencies]
llm = [
"vllm==0.11.2",
]
HuggingFace server dependencies from `python/huggingfaceserver/pyproject.toml`:
dependencies = [
"kserve[llm]",
"transformers>=4.53.2",
"accelerate<2.0.0,>=1.6.0",
"bitsandbytes>=0.45.3",
]
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `CUDA out of memory` | Insufficient VRAM for model | Reduce `--gpu-memory-utilization` or use smaller model |
| `Model is too large` | max_model_len exceeds memory | Set `--max-model-len` to a smaller value |
| NixlConnector timeout | RDMA not configured | Verify SR-IOV/RDMA network and `KSERVE_INFER_ROCE` env var |
Compatibility Notes
- CPU inference: Supported but significantly slower; set `--device cpu`
- Prefix caching: Requires matching `PYTHONHASHSEED` and `--block-size` across pods and scheduler
- Disaggregated PD: Requires NixlConnector and RDMA network for KV cache transfer