Principle:Kserve Kserve VLLM Health Monitoring

Knowledge Sources	vLLM Health Check Kubernetes Probes
Domains	Health_Monitoring, LLM_Serving, Kubernetes
Last Updated	2026-02-13 00:00 GMT

Overview

A health monitoring pattern using Kubernetes liveness and readiness probes against the vLLM engine's health endpoint to detect model loading completion and runtime failures.

Description

VLLM Health Monitoring uses Kubernetes probe mechanisms to track the vLLM model server lifecycle:

Liveness probe: GET /health on port 8000 (HTTPS). Returns 200 when the vLLM engine is running. initialDelaySeconds: 120 allows time for model loading.
Readiness probe: Similar to liveness, ensures the model is loaded and ready to serve before receiving traffic.

The long initial delay is critical for LLMs because model loading (downloading weights, building KV cache) can take significant time depending on model size and GPU count.

Usage

Configure probes in the LLMInferenceService pod template. Adjust initialDelaySeconds based on model size: 120s for 7B models, up to 4800s (80 min) for 600B+ models like DeepSeek-R1.

Theoretical Basis

# Health monitoring model (NOT implementation code)
vLLM startup sequence:
  1. Container starts
  2. Download model weights (if not on PVC)
  3. Load weights into GPU memory
  4. Initialize KV cache
  5. /health endpoint returns 200

Probe configuration:
  liveness:  /health, initialDelay=120s, period=30s, timeout=30s, failures=5
  readiness: /health, initialDelay=10s,  period=10s, failures=60

Related Pages

Implemented By

Implementation:Kserve_Kserve_VLLM_Health_Endpoint

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment