Principle:Triton inference server Server HTTP Server Architecture
Overview
HTTP Server Architecture describes the event-driven design that underpins Triton Inference Server's HTTP/REST frontend. Built on the libevhtp library (itself layered atop libevent), the server provides a multi-threaded, non-blocking event loop that dispatches incoming inference requests to handler methods implementing the KFServing (now KServe) v2 community standard inference protocol, as well as Triton-specific extensions. The architecture is designed as an inheritance hierarchy: a base HTTPServer class manages connection lifecycle and threading, while derived classes (HTTPAPIServer, HTTPMetricsServer, and endpoint-specific servers) implement protocol-specific request routing and response formatting.
Theoretical Basis
Event-Driven Design for Inference Throughput
Inference serving is fundamentally an I/O-bound workload at the network layer: the server must accept many concurrent connections, parse requests, dispatch them to GPU-bound inference pipelines, and stream back results. An event-driven architecture is ideal because it multiplexes many connections across a small thread pool without the overhead of thread-per-connection designs. Triton uses libevent's event_base for the event loop and libevhtp's evhtp_t for HTTP protocol handling, spawning a configurable number of worker threads (default 8) to process events in parallel.
Base HTTPServer Class
The base HTTPServer class encapsulates:
| Component | Purpose |
|---|---|
port_, address_, reuse_port_ |
Network binding configuration |
thread_cnt_ |
Number of evhtp worker threads |
header_forward_regex_ |
RE2 pattern for forwarding request headers to backends |
conn_cnt_, conn_mu_ |
Thread-safe active connection counting |
fds_[2], break_ev_ |
Socketpair and event for signaling graceful shutdown |
The Start() method initializes the event base, binds the socket, and launches worker threads. Stop() signals the event loop via the socketpair, joins the worker thread, and waits for active connections to drain within a configurable timeout.
Connection Lifecycle Management
NewConnection and EndConnection hooks maintain an atomic connection count. When Stop() is called, the server sets accepting_new_conn_ to false, causing NewConnection to reject incoming connections while existing ones are allowed to complete. This ensures in-flight inference requests finish before the server exits.
HTTPAPIServer: KFServing Protocol Implementation
The HTTPAPIServer derived class implements the full KFServing v2 REST API with RE2-based URL routing:
/v2/health/live,/v2/health/ready-- Server health endpoints/v2-- Server metadata/v2/models/{model}/versions/{version}/infer-- Inference/v2/models/{model}/config-- Model configuration/v2/models/{model}/stats-- Model statistics/v2/repository/index,/v2/repository/models/{model}/load|unload-- Model management/v2/systemsharedmemory,/v2/cudasharedmemory-- Shared memory regions/v2/trace-- Tracing control/v2/logging-- Log settings
Additionally, Triton extends the protocol with /v2/models/{model}/generate and /v2/models/{model}/generate_stream for LLM text generation, implementing Server-Sent Events (SSE) for streaming responses.
Request Processing Pipeline
Each inference request flows through a well-defined pipeline:
- Header extraction: compression type detection, content length, inference header length
- Decompression: if the request body is gzip/deflate encoded
- JSON parsing: the request body is parsed into the KFServing inference request schema
- Triton core dispatch: the parsed request is converted to a
TRITONSERVER_InferenceRequestand submitted viaTRITONSERVER_ServerInferAsync - Asynchronous response: the InferRequestClass captures the evhtp thread context so the response callback can send the reply on the correct thread
- Response compression: if the client accepts compressed responses
- Reply: the serialized response (JSON + optional binary tensor data) is sent back
MappingSchema for Generate Endpoints
The generate/LLM endpoints use a MappingSchema system that defines how JSON request fields map to Triton inputs and how Triton outputs map back to JSON response fields. The schema supports EXACT_MAPPING (direct name-to-name mapping) and MAPPING_SCHEMA (nested mapping with parameters). This abstraction allows flexible request/response formats for different LLM serving protocols.
Thread Affinity for Response Delivery
A critical design constraint is that evhtp requires responses to be sent from the same thread that accepted the request. The InferRequestClass captures the originating evthr_t* thread handle, and the asynchronous response callback uses evthr_defer to schedule response delivery on the correct thread. This design enables safe asynchronous inference without blocking the event loop.
Related Pages
Implementation:Triton_inference_server_Server_HTTPServer Triton_inference_server_Server