Implementation:Kserve Kserve Triton Runtime
| Knowledge Sources | |
|---|---|
| Domains | Kubernetes, Model Serving |
| Last Updated | 2026-02-13 00:00 GMT |
Overview
Concrete ClusterServingRuntime for NVIDIA Triton Inference Server provided by the KServe project.
Description
This file defines a ClusterServingRuntime named kserve-tritonserver using NVIDIA Triton Inference Server for serving multiple model formats. It supports tensorrt (version 8), tensorflow (versions 1 and 2), onnx (version 1), and triton (version 2) at priority 1 with auto-select enabled, plus pytorch (version 1) without auto-select. The runtime uses v2 and grpc-v2 protocols, runs the tritonserver command with HTTP port 8080 and gRPC port 9000, exposes Prometheus metrics on port 8002, and runs as user 1000 (non-root) with strict security context. This provides the most versatile GPU-optimized inference runtime supporting the widest range of model formats.
Usage
This ClusterServingRuntime is applied cluster-wide and auto-selected for tensorrt, tensorflow, onnx, and triton models at priority 1. It is the preferred runtime for GPU-accelerated inference using NVIDIA hardware. Users create InferenceService resources with the appropriate model format and KServe automatically selects this runtime. PyTorch models require explicit runtime selection as auto-select is not enabled for that format.
Code Reference
Source Location
- Repository: Kserve_Kserve
- File: config/runtimes/kserve-tritonserver.yaml
- Lines: 1-59
Signature
apiVersion: serving.kserve.io/v1alpha1
kind: ClusterServingRuntime
metadata:
name: kserve-tritonserver
spec:
annotations:
prometheus.kserve.io/port: '8002'
prometheus.kserve.io/path: "/metrics"
supportedModelFormats:
- name: tensorrt
version: "8"
autoSelect: true
priority: 1
- name: tensorflow
version: "1"
autoSelect: true
priority: 1
- name: tensorflow
version: "2"
autoSelect: true
priority: 1
- name: onnx
version: "1"
autoSelect: true
priority: 1
- name: pytorch
version: "1"
- name: triton
version: "2"
autoSelect: true
priority: 1
protocolVersions:
- v2
- grpc-v2
containers:
- name: kserve-container
image: kserve-tritonserver:replace
args:
- tritonserver
- --model-store=/mnt/models
- --grpc-port=9000
- --http-port=8080
- --allow-grpc=true
- --allow-http=true
securityContext:
runAsUser: 1000
Import
kubectl apply -f config/runtimes/kserve-tritonserver.yaml
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| Model artifacts | storage URI | Yes | Model files at /mnt/models (provided by KServe storage initializer) |
Outputs
| Name | Type | Description |
|---|---|---|
| ClusterServingRuntime | Custom Resource | Triton runtime available cluster-wide for tensorrt, tensorflow, onnx, pytorch, and triton models |
| HTTP inference endpoint | TCP port 8080 | V2 inference protocol HTTP endpoint |
| gRPC inference endpoint | TCP port 9000 | V2/grpc-v2 inference protocol gRPC endpoint |
| Prometheus metrics | HTTP port 8002 /metrics | Model serving and GPU metrics endpoint |
Usage Examples
Apply the runtime
kubectl apply -f config/runtimes/kserve-tritonserver.yaml
Create a TensorRT InferenceService
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: tensorrt-resnet
spec:
predictor:
model:
modelFormat:
name: tensorrt
storageUri: "gs://my-bucket/tensorrt/resnet"