Principle:Triton inference server Server Health Check API
| Knowledge Sources | |
|---|---|
| Domains | MLOps, Observability, Model_Serving |
| Last Updated | 2026-02-13 17:00 GMT |
Overview
A readiness and liveness probing mechanism that allows clients and orchestrators to verify an inference server's operational state before sending requests.
Description
Health Check APIs provide HTTP endpoints for determining whether an inference server is alive (process running) and ready (models loaded and accepting requests). This follows the KServe v2 inference protocol standard and is essential for container orchestration systems like Kubernetes, which use liveness and readiness probes to manage service lifecycle.
The distinction between liveness and readiness is critical: a server can be live (process running, accepting connections) but not ready (models still loading). This allows orchestrators to avoid restarting a server that is merely initializing while still detecting genuinely failed processes.
Usage
Use health check endpoints immediately after launching an inference server to verify it has fully initialized. Integrate with Kubernetes liveness/readiness probes for production deployments. Also use model-specific readiness checks to verify individual models are loaded before sending targeted inference requests.
Theoretical Basis
The KServe v2 health protocol defines three endpoints:
GET /v2/health/live → Server process is running
GET /v2/health/ready → Server is ready to accept inference requests
GET /v2/models/<name>/ready → Specific model is loaded and ready
Response semantics:
- HTTP 200: Healthy/Ready
- HTTP 400: Not ready (still loading or error state)