Principle:Triton inference server Server Health Check API

Knowledge Sources	KServe V2 Protocol Triton Server
Domains	MLOps, Observability, Model_Serving
Last Updated	2026-02-13 17:00 GMT

Overview

A readiness and liveness probing mechanism that allows clients and orchestrators to verify an inference server's operational state before sending requests.

Description

Health Check APIs provide HTTP endpoints for determining whether an inference server is alive (process running) and ready (models loaded and accepting requests). This follows the KServe v2 inference protocol standard and is essential for container orchestration systems like Kubernetes, which use liveness and readiness probes to manage service lifecycle.

The distinction between liveness and readiness is critical: a server can be live (process running, accepting connections) but not ready (models still loading). This allows orchestrators to avoid restarting a server that is merely initializing while still detecting genuinely failed processes.

Usage

Use health check endpoints immediately after launching an inference server to verify it has fully initialized. Integrate with Kubernetes liveness/readiness probes for production deployments. Also use model-specific readiness checks to verify individual models are loaded before sending targeted inference requests.

Theoretical Basis

The KServe v2 health protocol defines three endpoints:

GET /v2/health/live   → Server process is running
GET /v2/health/ready  → Server is ready to accept inference requests
GET /v2/models/<name>/ready  → Specific model is loaded and ready

Response semantics:

HTTP 200: Healthy/Ready
HTTP 400: Not ready (still loading or error state)

Related Pages

Implemented By

Implementation:Triton_inference_server_Server_HTTP_Health_Endpoint

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment