Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Kserve Kserve VLLM Health Monitoring

From Leeroopedia
Knowledge Sources
Domains Health_Monitoring, LLM_Serving, Kubernetes
Last Updated 2026-02-13 00:00 GMT

Overview

A health monitoring pattern using Kubernetes liveness and readiness probes against the vLLM engine's health endpoint to detect model loading completion and runtime failures.

Description

VLLM Health Monitoring uses Kubernetes probe mechanisms to track the vLLM model server lifecycle:

  • Liveness probe: GET /health on port 8000 (HTTPS). Returns 200 when the vLLM engine is running. initialDelaySeconds: 120 allows time for model loading.
  • Readiness probe: Similar to liveness, ensures the model is loaded and ready to serve before receiving traffic.

The long initial delay is critical for LLMs because model loading (downloading weights, building KV cache) can take significant time depending on model size and GPU count.

Usage

Configure probes in the LLMInferenceService pod template. Adjust initialDelaySeconds based on model size: 120s for 7B models, up to 4800s (80 min) for 600B+ models like DeepSeek-R1.

Theoretical Basis

# Health monitoring model (NOT implementation code)
vLLM startup sequence:
  1. Container starts
  2. Download model weights (if not on PVC)
  3. Load weights into GPU memory
  4. Initialize KV cache
  5. /health endpoint returns 200

Probe configuration:
  liveness:  /health, initialDelay=120s, period=30s, timeout=30s, failures=5
  readiness: /health, initialDelay=10s,  period=10s, failures=60

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment