Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:Kserve Kserve VLLM Runtime

From Leeroopedia
Knowledge Sources
Domains LLM_Serving, GPU_Computing
Last Updated 2026-02-13 14:00 GMT

Overview

vLLM 0.11.2 inference engine for high-throughput LLM serving with PagedAttention, prefix caching, and disaggregated prefill-decode.

Description

vLLM is the default inference engine for LLMInferenceService deployments. It provides efficient GPU memory management through PagedAttention, automatic prefix caching for repeated prompt patterns, and support for disaggregated inference via NixlConnector for KV cache transfer. The HuggingFace server in KServe wraps vLLM for model serving.

Usage

Use this environment for LLM inference serving with GPU acceleration. Required for LLMInferenceService deployments and GPU-based HuggingFace model serving.

System Requirements

Category Requirement Notes
Python >= 3.10, < 3.14 For HuggingFace server
CUDA Compatible with vLLM build vLLM includes CUDA kernels
GPU NVIDIA with sufficient VRAM Model-dependent
vLLM 0.11.2 From kserve pyproject.toml

Dependencies

Python Packages

  • `vllm` == 0.11.2
  • `transformers` >= 4.53.2
  • `accelerate` >= 1.6.0, < 2.0.0
  • `bitsandbytes` >= 0.45.3
  • `torch` (bundled with vLLM)

Credentials

  • `HF_TOKEN`: HuggingFace API token for gated model downloads (e.g., Llama, Qwen)

Quick Install

# vLLM is included in the KServe HuggingFace server image
# For local development:
pip install vllm==0.11.2
pip install transformers>=4.53.2 accelerate>=1.6.0 bitsandbytes>=0.45.3

Code Evidence

vLLM dependency from `python/kserve/pyproject.toml`:

[project.optional-dependencies]
llm = [
    "vllm==0.11.2",
]

HuggingFace server dependencies from `python/huggingfaceserver/pyproject.toml`:

dependencies = [
    "kserve[llm]",
    "transformers>=4.53.2",
    "accelerate<2.0.0,>=1.6.0",
    "bitsandbytes>=0.45.3",
]

Common Errors

Error Message Cause Solution
`CUDA out of memory` Insufficient VRAM for model Reduce `--gpu-memory-utilization` or use smaller model
`Model is too large` max_model_len exceeds memory Set `--max-model-len` to a smaller value
NixlConnector timeout RDMA not configured Verify SR-IOV/RDMA network and `KSERVE_INFER_ROCE` env var

Compatibility Notes

  • CPU inference: Supported but significantly slower; set `--device cpu`
  • Prefix caching: Requires matching `PYTHONHASHSEED` and `--block-size` across pods and scheduler
  • Disaggregated PD: Requires NixlConnector and RDMA network for KV cache transfer

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment