Principle:Predibase Lorax Health Check Verification
| Knowledge Sources | |
|---|---|
| Domains | Observability, Model_Serving |
| Last Updated | 2026-02-08 02:00 GMT |
Overview
A liveness verification pattern that confirms inference shards are operational by executing a minimal generation request through the full inference pipeline.
Description
Health Check Verification addresses the problem of detecting whether a GPU inference server is truly ready to serve requests, beyond simple TCP connectivity. A model server can be running but not functional (e.g., model not loaded, GPU memory exhausted, CUDA error). This principle uses an end-to-end generation test: construct a minimal request, send it through all shards, and verify that token generation succeeds.
The pattern integrates with Kubernetes liveness and readiness probes via the /health HTTP endpoint.
Usage
Use this principle to verify server readiness after startup and to continuously monitor health during operation. Critical for Kubernetes deployments where pod lifecycle depends on health probe responses.
Theoretical Basis
Pseudo-code:
# Health check algorithm
def check_health(sharded_client, shard_info):
batch = create_minimal_batch(
request_id=0,
input_text="liveness",
max_tokens=2
)
prefill_result = sharded_client.prefill(batch)
decode_result = sharded_client.decode(batch)
return decode_result.is_success()