Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Predibase Lorax Health Check Verification

From Leeroopedia


Knowledge Sources
Domains Observability, Model_Serving
Last Updated 2026-02-08 02:00 GMT

Overview

A liveness verification pattern that confirms inference shards are operational by executing a minimal generation request through the full inference pipeline.

Description

Health Check Verification addresses the problem of detecting whether a GPU inference server is truly ready to serve requests, beyond simple TCP connectivity. A model server can be running but not functional (e.g., model not loaded, GPU memory exhausted, CUDA error). This principle uses an end-to-end generation test: construct a minimal request, send it through all shards, and verify that token generation succeeds.

The pattern integrates with Kubernetes liveness and readiness probes via the /health HTTP endpoint.

Usage

Use this principle to verify server readiness after startup and to continuously monitor health during operation. Critical for Kubernetes deployments where pod lifecycle depends on health probe responses.

Theoretical Basis

Pseudo-code:

# Health check algorithm
def check_health(sharded_client, shard_info):
    batch = create_minimal_batch(
        request_id=0,
        input_text="liveness",
        max_tokens=2
    )
    prefill_result = sharded_client.prefill(batch)
    decode_result = sharded_client.decode(batch)
    return decode_result.is_success()

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment