Principle:Triton inference server Server ORCA Load Reporting
Overview
ORCA Load Reporting is the principle governing how Triton Inference Server communicates real-time server load information to client-side load balancers via HTTP response headers, following the ORCA (Open Request Cost Aggregation) protocol. The OrcaHTTP module extracts key-value (KV) cache utilization metrics from Triton's Prometheus metric endpoint, computes derived metrics (cache utilization ratio, maximum token capacity), and formats them as ORCA-compliant response headers in either JSON or text format. This enables intelligent load balancing decisions for large language model (LLM) inference deployments.
Theoretical Basis
Why Load Reporting Matters for LLM Inference
Large language model inference has fundamentally different load characteristics than traditional ML inference. A standard image classification model has deterministic per-request resource consumption, but LLM inference varies dramatically based on sequence length, KV cache occupancy, and concurrent batch size. A server may have available GPU compute but be bottlenecked on KV cache memory, or vice versa. Without per-response load signals, client-side load balancers must rely on crude metrics like connection count or round-robin distribution, leading to imbalanced load and degraded tail latencies.
The ORCA protocol addresses this by embedding load metrics directly in inference response headers, allowing load balancers to make request-level routing decisions based on the actual server state at the time of response.
ORCA Protocol Compliance
ORCA defines a standard mechanism for reporting named application-level metrics alongside inference responses. Triton implements this through the endpoint-load-metrics response header, which carries metrics in one of two formats based on the client's endpoint-load-metrics-format request header:
| Format | Header Prefix | Example |
|---|---|---|
| JSON | JSON |
JSON {"named_metrics":{"kv_cache_utilization":0.75,"max_token_capacity":8192}}
|
| Text (Native HTTP) | TEXT |
TEXT named_metrics.kv_cache_utilization=0.750000, named_metrics.max_token_capacity=8192
|
Metric Extraction Pipeline
The load reporting pipeline operates in several stages:
- Prometheus Metric Retrieval: The module calls
TRITONSERVER_ServerMetrics()to obtain the server's Prometheus-formatted metrics string. - Metric Family Extraction: The MetricFamilyExtractor function uses RE2 regular expressions to parse the Prometheus text format, extracting all metrics from the
nv_trt_llm_kv_cache_block_metricsfamily along with their labels and values. - KV Cache Metric Interpretation: Three specific metric labels are extracted:
tokens_per(tokens per block),used(currently used blocks), andmax(maximum available blocks). - Derived Metric Computation: KV cache utilization is computed as
used_blocks / max_blocks, and maximum token capacity asmax_blocks * tokens_per_block. - Header Formatting: The derived metrics are formatted according to the requested ORCA type (JSON or text) and attached to the response headers.
Prometheus Text Format Parsing
The MetricFamilyExtractor function provides a general-purpose Prometheus text format parser that extracts structured PromMetric objects from raw Prometheus output. It handles:
- Metric families with labels:
metric_name{label1="value1",label2="value2"} 42.0 - Metric families without labels:
metric_name 42.0 - Multiple metrics within the same family with different label sets
Each parsed metric is stored as a PromMetric struct containing a label map (std::unordered_map<std::string, std::string>) and a double value. This generic design means the extractor can be reused for additional metric families beyond KV cache in the future.
Error Resilience
The module handles several failure modes gracefully:
- If the Prometheus metrics endpoint is unavailable, the error is logged but the inference response is still sent (without load metrics).
- If any of the three required KV cache metrics are missing or negative, an error is logged and no load header is attached (rather than sending partial or incorrect data).
- If an invalid
orca_typeis specified, an error is logged and no header is produced.
This defensive approach ensures that the load reporting feature never interferes with the primary inference serving function.
Integration with HTTP Server
The SetEndpointLoadMetricsHeader() function is called from the HTTP response path when the incoming request includes the endpoint-load-metrics-format header. The function is called with the evhtp request object, the requested format string, and a pointer to the Triton server instance, keeping the ORCA logic self-contained and minimally coupled to the HTTP server implementation.
Related Pages
Implementation:Triton_inference_server_Server_OrcaHTTP Triton_inference_server_Server