Principle:Predibase Lorax Health Check Verification

Knowledge Sources	Kubernetes Health Probes
Domains	Observability, Model_Serving
Last Updated	2026-02-08 02:00 GMT

Overview

A liveness verification pattern that confirms inference shards are operational by executing a minimal generation request through the full inference pipeline.

Description

Health Check Verification addresses the problem of detecting whether a GPU inference server is truly ready to serve requests, beyond simple TCP connectivity. A model server can be running but not functional (e.g., model not loaded, GPU memory exhausted, CUDA error). This principle uses an end-to-end generation test: construct a minimal request, send it through all shards, and verify that token generation succeeds.

The pattern integrates with Kubernetes liveness and readiness probes via the /health HTTP endpoint.

Usage

Use this principle to verify server readiness after startup and to continuously monitor health during operation. Critical for Kubernetes deployments where pod lifecycle depends on health probe responses.

Theoretical Basis

Pseudo-code:

# Health check algorithm
def check_health(sharded_client, shard_info):
    batch = create_minimal_batch(
        request_id=0,
        input_text="liveness",
        max_tokens=2
    )
    prefill_result = sharded_client.prefill(batch)
    decode_result = sharded_client.decode(batch)
    return decode_result.is_success()

Related Pages

Implemented By

Implementation:Predibase_Lorax_Health_Check_Generation

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment