Environment:Kserve Kserve VLLM Runtime

Knowledge Sources	KServe vLLM
Domains	LLM_Serving, GPU_Computing
Last Updated	2026-02-13 14:00 GMT

Overview

vLLM 0.11.2 inference engine for high-throughput LLM serving with PagedAttention, prefix caching, and disaggregated prefill-decode.

Description

vLLM is the default inference engine for LLMInferenceService deployments. It provides efficient GPU memory management through PagedAttention, automatic prefix caching for repeated prompt patterns, and support for disaggregated inference via NixlConnector for KV cache transfer. The HuggingFace server in KServe wraps vLLM for model serving.

Usage

Use this environment for LLM inference serving with GPU acceleration. Required for LLMInferenceService deployments and GPU-based HuggingFace model serving.

System Requirements

Category	Requirement	Notes
Python	>= 3.10, < 3.14	For HuggingFace server
CUDA	Compatible with vLLM build	vLLM includes CUDA kernels
GPU	NVIDIA with sufficient VRAM	Model-dependent
vLLM	0.11.2	From kserve pyproject.toml

Dependencies

Python Packages

`vllm` == 0.11.2
`transformers` >= 4.53.2
`accelerate` >= 1.6.0, < 2.0.0
`bitsandbytes` >= 0.45.3
`torch` (bundled with vLLM)

Credentials

`HF_TOKEN`: HuggingFace API token for gated model downloads (e.g., Llama, Qwen)

Quick Install

# vLLM is included in the KServe HuggingFace server image
# For local development:
pip install vllm==0.11.2
pip install transformers>=4.53.2 accelerate>=1.6.0 bitsandbytes>=0.45.3

Code Evidence

vLLM dependency from `python/kserve/pyproject.toml`:

[project.optional-dependencies]
llm = [
    "vllm==0.11.2",
]

HuggingFace server dependencies from `python/huggingfaceserver/pyproject.toml`:

dependencies = [
    "kserve[llm]",
    "transformers>=4.53.2",
    "accelerate<2.0.0,>=1.6.0",
    "bitsandbytes>=0.45.3",
]

Common Errors

Error Message	Cause	Solution
`CUDA out of memory`	Insufficient VRAM for model	Reduce `--gpu-memory-utilization` or use smaller model
`Model is too large`	max_model_len exceeds memory	Set `--max-model-len` to a smaller value
NixlConnector timeout	RDMA not configured	Verify SR-IOV/RDMA network and `KSERVE_INFER_ROCE` env var

Compatibility Notes

CPU inference: Supported but significantly slower; set `--device cpu`
Prefix caching: Requires matching `PYTHONHASHSEED` and `--block-size` across pods and scheduler
Disaggregated PD: Requires NixlConnector and RDMA network for KV cache transfer

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment