Implementation:Triton inference server Server OrcaHTTP

Knowledge Sources	Triton Inference Server
Domains	Load_Balancing, Metrics
Last Updated	2026-02-13 17:00 GMT

Overview

Concrete tool for extracting ORCA (Open Request Cost Aggregation) endpoint load metrics from Triton's Prometheus metrics and formatting them as HTTP headers for load-aware routing.

Description

The ORCA HTTP module queries Triton's Prometheus metrics endpoint, parses KV-cache block metrics (tokens_per_block, used_blocks, max_blocks), and computes derived metrics such as kv_cache_utilization and max_token_capacity. These metrics are formatted into HTTP response headers following the ORCA protocol, enabling external load balancers to make intelligent routing decisions based on real-time server load.

Usage

Used by Triton's HTTP server when ORCA load reporting is enabled. Particularly relevant for LLM serving scenarios where KV-cache utilization is a key load indicator for routing decisions.

Code Reference

Source Location

Repository: Triton Inference Server
File: src/orca_http.h
Lines: 1-67
File: src/orca_http.cc
Lines: 1-233

Signature

namespace triton { namespace server {

struct PromMetric {
  std::string name;
  std::unordered_map<std::string, std::string> labels;
  double value;
};

// Parse Prometheus metrics for KV-cache data
TRITONSERVER_Error* GetOrcaLoadMetrics(
    TRITONSERVER_Server* server,
    std::unordered_map<std::string, double>* metrics);

// Format metrics into ORCA HTTP headers
TRITONSERVER_Error* SetOrcaHeaders(
    evhtp_request_t* req,
    const std::unordered_map<std::string, double>& metrics);

}} // namespace triton::server

Import

#include "orca_http.h"

I/O Contract

Inputs

Name	Type	Required	Description
server	TRITONSERVER_Server*	Yes	Triton server instance for metric queries
req	evhtp_request_t*	Yes	HTTP request to attach headers to

Outputs

Name	Type	Description
metrics	unordered_map<string, double>	KV-cache utilization metrics
HTTP headers	string	ORCA-format load metric headers

Usage Examples

ORCA Metrics in HTTP Response

#include "orca_http.h"

// Inside HTTP request handler
std::unordered_map<std::string, double> metrics;
auto err = GetOrcaLoadMetrics(server_, &metrics);
if (err == nullptr) {
  SetOrcaHeaders(req, metrics);
  // Response now includes headers like:
  // X-Endpoint-Load: kv_cache_utilization=0.75
}

Related Pages

Environment:Triton_inference_server_Server_GPU_CUDA_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment