Principle:Triton inference server Server Vertex AI Integration

Overview

Vertex AI Integration is the principle governing how Triton Inference Server presents a Google Vertex AI Prediction-compatible HTTP endpoint, enabling Triton to be deployed as a custom prediction container within Google Cloud's Vertex AI platform. The VertexAiAPIServer class extends Triton's base HTTP API server to implement Vertex AI's endpoint contract -- a configurable prediction route for inference requests and a health route for container lifecycle management -- while mapping these to Triton's standard inference pipeline internally.

Theoretical Basis

Why Vertex AI Compatibility Matters

Google Vertex AI is a fully managed ML platform that provides infrastructure for training, deploying, and managing ML models at scale. Vertex AI custom prediction containers must implement a specific HTTP contract defined by Google: a prediction endpoint at a configurable route (typically /v1/endpoints/{endpoint}/deployedModels/{model}:predict) and a health endpoint (typically /v1/endpoints/{endpoint}/deployedModels/{model}). By implementing this contract natively, Triton can serve as a Vertex AI custom container without requiring a proxy or adapter layer, giving users access to Triton's multi-framework inference, dynamic batching, and GPU optimization within the Vertex AI managed environment.

Endpoint Route Configuration

Vertex AI communicates the expected routes to the container through environment variables:

Environment Variable	Purpose	Default
`AIP_PREDICT_ROUTE`	Path for prediction (inference) requests	(required by Vertex AI)
`AIP_HEALTH_ROUTE`	Path for health check requests	(required by Vertex AI)
`AIP_HTTP_PORT`	Port to listen on	8080

The VertexAiAPIServer constructor accepts the prediction route, health route, and a default model name, constructing RE2 regular expressions for URL matching. The prediction regex captures model name and version from the URL path, while the health regex matches the health endpoint.

Default Model Routing

Vertex AI deployments typically serve a single model per endpoint. The server supports a vertex_ai_default_model_ configuration (set via the --vertex-ai-default-model CLI flag) that specifies which model to route inference requests to when the URL does not explicitly name a model. This simplifies the Vertex AI integration by allowing the prediction route to work without requiring the client to know the Triton-internal model name.

Request Format Translation

Vertex AI prediction requests arrive in a JSON format that may differ from the KFServing v2 inference protocol. The server handles this translation through its inherited HTTPAPIServer infrastructure, with the following specializations:

Inference header length: Vertex AI does not use the Inference-Header-Content-Length header. The override returns the full content length as the header length, indicating the entire body is the inference JSON.
Compression: Both request and response compression types are hardcoded to IDENTITY since Vertex AI's compression protocol is not yet defined.

Health Check Implementation

The health endpoint delegates to Triton's standard health check mechanism via HandleServerHealth(). The health_mode_ field (read from the AIP_HEALTH_ROUTE environment or defaulting to "live") determines whether the health check reports liveness or readiness, matching Vertex AI's container health check expectations.

Metrics Endpoint

The Vertex AI server additionally handles a /metrics endpoint (via HandleMetrics()) that returns Prometheus-formatted metrics. This enables Vertex AI's monitoring infrastructure to scrape Triton's metrics through the same port used for prediction, avoiding the need for a separate metrics port configuration in the Vertex AI container spec.

Server Architecture

The VertexAiAPIServer inherits from HTTPAPIServer which inherits from HTTPServer, reusing the entire evhtp-based event-driven HTTP infrastructure. This means the Vertex AI endpoint benefits from the same multi-threaded event loop, connection management, shared memory integration, and tracing support as the standard Triton HTTP endpoint. The only differences are the URL routing patterns, the response header format, and the compression behavior.

Port and Address Configuration

The Vertex AI endpoint listens on a separate port (default 8080, configurable via --vertex-ai-port) and address (default 0.0.0.0, configurable via --vertex-ai-address). Triton's port collision detection (CheckPortCollision() in the parameter struct) ensures the Vertex AI port does not conflict with the standard HTTP, gRPC, or metrics ports.

Conditional Compilation

The entire Vertex AI integration is guarded by the TRITON_ENABLE_VERTEX_AI preprocessor flag. When not enabled, the Vertex AI server code, CLI options, and parameter fields are excluded from the build, keeping the binary size minimal for deployments that do not target Google Cloud.

Binary MIME Type and Redirect Headers

The server defines static constants for the binary MIME type (application/octet-stream) and redirect header used in Vertex AI response formatting, ensuring consistent header values across all response paths.

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment