Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Kserve Kserve VLLM Health Endpoint

From Leeroopedia
Knowledge Sources
Domains Health_Monitoring, LLM_Serving, Kubernetes
Last Updated 2026-02-13 00:00 GMT

Overview

Concrete Kubernetes probe configuration for monitoring vLLM engine health in LLMInferenceService deployments.

Description

The vLLM engine exposes a /health endpoint on port 8000 (HTTPS). KServe configures liveness probes in the LLMInferenceService pod template to detect when the model is loaded and the engine is ready. The probe configuration is defined in the sample YAML and the default pod template ConfigMap.

Usage

Include liveness probe configuration in the LLMInferenceService pod template. The default templates in config/llmisvcconfig/ already include appropriate probe settings.

Code Reference

Source Location

  • Repository: kserve
  • File: docs/samples/llmisvc/single-node-gpu/llm-inference-service-qwen2-7b-gpu.yaml, Lines 27-35
  • File: config/llmisvcconfig/config-llm-template.yaml (default pod template)

Signature

livenessProbe:
  httpGet:
    path: /health
    port: 8000
    scheme: HTTPS
  initialDelaySeconds: 120
  periodSeconds: 30
  timeoutSeconds: 30
  failureThreshold: 5

Import

# External dependency: vLLM engine provides /health endpoint
# No code import needed — configured via YAML

I/O Contract

Inputs

Name Type Required Description
httpGet.path string Yes /health endpoint path
httpGet.port int Yes 8000 (vLLM default)
httpGet.scheme string Yes HTTPS (vLLM TLS)
initialDelaySeconds int Yes Wait before first probe (120s for 7B, 4800s for 600B+)

Outputs

Name Type Description
HTTP 200 response Engine running, model loaded
HTTP 503 response Engine starting, model not yet loaded
Pod Ready condition Kubernetes marks pod as Ready when probe passes

Usage Examples

Monitor Model Loading

# Watch pod status during model loading
kubectl get pods -l app.kubernetes.io/component=llminferenceservice-workload -w

# Check liveness probe events
kubectl describe pod <pod-name> | grep -A5 "Liveness"

# View vLLM logs during loading
kubectl logs <pod-name> -c main --follow

# Manual health check
kubectl exec <pod-name> -- curl -k https://localhost:8000/health

Related Pages

Implements Principle

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment