Implementation:Triton inference server Server OrcaHTTP
| Knowledge Sources | |
|---|---|
| Domains | Load_Balancing, Metrics |
| Last Updated | 2026-02-13 17:00 GMT |
Overview
Concrete tool for extracting ORCA (Open Request Cost Aggregation) endpoint load metrics from Triton's Prometheus metrics and formatting them as HTTP headers for load-aware routing.
Description
The ORCA HTTP module queries Triton's Prometheus metrics endpoint, parses KV-cache block metrics (tokens_per_block, used_blocks, max_blocks), and computes derived metrics such as kv_cache_utilization and max_token_capacity. These metrics are formatted into HTTP response headers following the ORCA protocol, enabling external load balancers to make intelligent routing decisions based on real-time server load.
Usage
Used by Triton's HTTP server when ORCA load reporting is enabled. Particularly relevant for LLM serving scenarios where KV-cache utilization is a key load indicator for routing decisions.
Code Reference
Source Location
- Repository: Triton Inference Server
- File: src/orca_http.h
- Lines: 1-67
- File: src/orca_http.cc
- Lines: 1-233
Signature
namespace triton { namespace server {
struct PromMetric {
std::string name;
std::unordered_map<std::string, std::string> labels;
double value;
};
// Parse Prometheus metrics for KV-cache data
TRITONSERVER_Error* GetOrcaLoadMetrics(
TRITONSERVER_Server* server,
std::unordered_map<std::string, double>* metrics);
// Format metrics into ORCA HTTP headers
TRITONSERVER_Error* SetOrcaHeaders(
evhtp_request_t* req,
const std::unordered_map<std::string, double>& metrics);
}} // namespace triton::server
Import
#include "orca_http.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| server | TRITONSERVER_Server* | Yes | Triton server instance for metric queries |
| req | evhtp_request_t* | Yes | HTTP request to attach headers to |
Outputs
| Name | Type | Description |
|---|---|---|
| metrics | unordered_map<string, double> | KV-cache utilization metrics |
| HTTP headers | string | ORCA-format load metric headers |
Usage Examples
ORCA Metrics in HTTP Response
#include "orca_http.h"
// Inside HTTP request handler
std::unordered_map<std::string, double> metrics;
auto err = GetOrcaLoadMetrics(server_, &metrics);
if (err == nullptr) {
SetOrcaHeaders(req, metrics);
// Response now includes headers like:
// X-Endpoint-Load: kv_cache_utilization=0.75
}