Principle:Triton inference server Server API Access Restriction
Overview
API Access Restriction is the principle governing how Triton Inference Server provides feature-level access control to its HTTP API endpoints through a header-based authentication mechanism. The RestrictedFeatures class implements a lightweight, category-based restriction system where individual API groups (health, metadata, inference, shared memory, model configuration, model repository, statistics, trace, logging) can each be independently gated behind a required HTTP header and value. This allows operators to selectively protect sensitive endpoints while leaving others open, without requiring a full external authentication service.
Theoretical Basis
Why API Access Restriction for Inference Servers
Production inference servers expose administrative APIs alongside inference endpoints. The model repository control API can load and unload models; the trace API can enable expensive tracing; the logging API can change log verbosity; the shared memory API can register memory regions that bypass normal input validation. Exposing these APIs without access control creates operational and security risks:
- Unauthorized model management: An attacker or misconfigured client could unload production models or load malicious ones.
- Resource exhaustion: Enabling tracing or increasing log verbosity can degrade inference throughput.
- Information disclosure: Model metadata and statistics may reveal proprietary architecture details.
A header-based restriction mechanism provides a simple but effective access control layer that can be applied selectively to each API category.
Restriction Categories
The system defines nine distinct API categories, each independently controllable:
| Category | Enum Value | Protected Endpoints |
|---|---|---|
health |
HEALTH |
/v2/health/live, /v2/health/ready
|
metadata |
METADATA |
/v2, /v2/models/{model} metadata
|
inference |
INFERENCE |
/v2/models/{model}/infer, generate endpoints
|
shared-memory |
SHARED_MEMORY |
System and CUDA shared memory APIs |
model-config |
MODEL_CONFIG |
/v2/models/{model}/config
|
model-repository |
MODEL_REPOSITORY |
Load, unload, repository index |
statistics |
STATISTICS |
/v2/models/{model}/stats
|
trace |
TRACE |
/v2/trace settings
|
logging |
LOGGING |
/v2/logging settings
|
Header-Value Authentication Model
Each restricted category is associated with a Restriction -- a pair of (header_name, expected_value). When a request arrives for a restricted endpoint, the server checks whether the request includes the specified header with the expected value. If the header is missing or the value does not match, the server returns an HTTP 400 error. This model supports several deployment patterns:
- Shared secret: Set a custom header like
X-Triton-Auth: my-secret-tokenfor all administrative endpoints. - Per-category secrets: Use different header/value combinations for different categories, enabling different teams to have access to different API groups.
- Proxy integration: An upstream reverse proxy or API gateway can inject the required header after performing its own authentication, keeping the actual secret out of client code.
Configuration via CLI
Restrictions are configured through the --http-restricted-api CLI flag, which is parsed by the TritonParser via ParseRestrictedFeatureOption(). The format is:
--http-restricted-api <category>:<header>=<value>
For example:
--http-restricted-api model-repository:X-Admin-Key=secret123
--http-restricted-api trace:X-Debug-Key=trace-token
Multiple flags can be specified to restrict different categories independently.
Implementation Design
The RestrictedFeatures class uses fixed-size arrays indexed by the RestrictedCategory enum, providing O(1) lookup for both restriction checking and header retrieval:
std::array<Restriction, CATEGORY_COUNT> restrictions_{};
std::array<bool, CATEGORY_COUNT> restricted_categories_{};
The ToCategory() static method converts a category name string to the enum by searching the RESTRICTED_CATEGORY_NAMES array, returning INVALID for unknown names. The Insert() method stores a restriction and marks the category as restricted. The IsRestricted() and Get() methods provide fast, constant-time access during request processing.
HTTP Server Integration
The HTTPAPIServer stores a RestrictedFeatures instance and calls RespondIfRestricted() at the beginning of each handler method. This method checks whether the request's endpoint category is restricted and, if so, verifies the header. The check is performed before any request processing, ensuring that restricted requests are rejected early without consuming inference resources.
Lightweight by Design
The restriction mechanism is intentionally simple: no user databases, no session management, no token refresh. For production deployments requiring robust authentication, operators typically deploy an API gateway (Envoy, Istio, Kong) in front of Triton. The built-in restriction mechanism serves as a defense-in-depth measure and a convenient solution for development and testing environments where full authentication infrastructure is not available.
Related Pages
Implementation:Triton_inference_server_Server_RestrictedFeatures Triton_inference_server_Server