Principle:Kserve Kserve VLLM Health Monitoring
| Knowledge Sources | |
|---|---|
| Domains | Health_Monitoring, LLM_Serving, Kubernetes |
| Last Updated | 2026-02-13 00:00 GMT |
Overview
A health monitoring pattern using Kubernetes liveness and readiness probes against the vLLM engine's health endpoint to detect model loading completion and runtime failures.
Description
VLLM Health Monitoring uses Kubernetes probe mechanisms to track the vLLM model server lifecycle:
- Liveness probe:
GET /healthon port 8000 (HTTPS). Returns 200 when the vLLM engine is running.initialDelaySeconds: 120allows time for model loading. - Readiness probe: Similar to liveness, ensures the model is loaded and ready to serve before receiving traffic.
The long initial delay is critical for LLMs because model loading (downloading weights, building KV cache) can take significant time depending on model size and GPU count.
Usage
Configure probes in the LLMInferenceService pod template. Adjust initialDelaySeconds based on model size: 120s for 7B models, up to 4800s (80 min) for 600B+ models like DeepSeek-R1.
Theoretical Basis
# Health monitoring model (NOT implementation code)
vLLM startup sequence:
1. Container starts
2. Download model weights (if not on PVC)
3. Load weights into GPU memory
4. Initialize KV cache
5. /health endpoint returns 200
Probe configuration:
liveness: /health, initialDelay=120s, period=30s, timeout=30s, failures=5
readiness: /health, initialDelay=10s, period=10s, failures=60