Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Triton inference server Server OrcaHTTP

From Leeroopedia
Knowledge Sources
Domains Load_Balancing, Metrics
Last Updated 2026-02-13 17:00 GMT

Overview

Concrete tool for extracting ORCA (Open Request Cost Aggregation) endpoint load metrics from Triton's Prometheus metrics and formatting them as HTTP headers for load-aware routing.

Description

The ORCA HTTP module queries Triton's Prometheus metrics endpoint, parses KV-cache block metrics (tokens_per_block, used_blocks, max_blocks), and computes derived metrics such as kv_cache_utilization and max_token_capacity. These metrics are formatted into HTTP response headers following the ORCA protocol, enabling external load balancers to make intelligent routing decisions based on real-time server load.

Usage

Used by Triton's HTTP server when ORCA load reporting is enabled. Particularly relevant for LLM serving scenarios where KV-cache utilization is a key load indicator for routing decisions.

Code Reference

Source Location

Signature

namespace triton { namespace server {

struct PromMetric {
  std::string name;
  std::unordered_map<std::string, std::string> labels;
  double value;
};

// Parse Prometheus metrics for KV-cache data
TRITONSERVER_Error* GetOrcaLoadMetrics(
    TRITONSERVER_Server* server,
    std::unordered_map<std::string, double>* metrics);

// Format metrics into ORCA HTTP headers
TRITONSERVER_Error* SetOrcaHeaders(
    evhtp_request_t* req,
    const std::unordered_map<std::string, double>& metrics);

}} // namespace triton::server

Import

#include "orca_http.h"

I/O Contract

Inputs

Name Type Required Description
server TRITONSERVER_Server* Yes Triton server instance for metric queries
req evhtp_request_t* Yes HTTP request to attach headers to

Outputs

Name Type Description
metrics unordered_map<string, double> KV-cache utilization metrics
HTTP headers string ORCA-format load metric headers

Usage Examples

ORCA Metrics in HTTP Response

#include "orca_http.h"

// Inside HTTP request handler
std::unordered_map<std::string, double> metrics;
auto err = GetOrcaLoadMetrics(server_, &metrics);
if (err == nullptr) {
  SetOrcaHeaders(req, metrics);
  // Response now includes headers like:
  // X-Endpoint-Load: kv_cache_utilization=0.75
}

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment