Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Kserve Kserve Triton Runtime

From Leeroopedia
Knowledge Sources
Domains Kubernetes, Model Serving
Last Updated 2026-02-13 00:00 GMT

Overview

Concrete ClusterServingRuntime for NVIDIA Triton Inference Server provided by the KServe project.

Description

This file defines a ClusterServingRuntime named kserve-tritonserver using NVIDIA Triton Inference Server for serving multiple model formats. It supports tensorrt (version 8), tensorflow (versions 1 and 2), onnx (version 1), and triton (version 2) at priority 1 with auto-select enabled, plus pytorch (version 1) without auto-select. The runtime uses v2 and grpc-v2 protocols, runs the tritonserver command with HTTP port 8080 and gRPC port 9000, exposes Prometheus metrics on port 8002, and runs as user 1000 (non-root) with strict security context. This provides the most versatile GPU-optimized inference runtime supporting the widest range of model formats.

Usage

This ClusterServingRuntime is applied cluster-wide and auto-selected for tensorrt, tensorflow, onnx, and triton models at priority 1. It is the preferred runtime for GPU-accelerated inference using NVIDIA hardware. Users create InferenceService resources with the appropriate model format and KServe automatically selects this runtime. PyTorch models require explicit runtime selection as auto-select is not enabled for that format.

Code Reference

Source Location

Signature

apiVersion: serving.kserve.io/v1alpha1
kind: ClusterServingRuntime
metadata:
  name: kserve-tritonserver
spec:
  annotations:
    prometheus.kserve.io/port: '8002'
    prometheus.kserve.io/path: "/metrics"
  supportedModelFormats:
    - name: tensorrt
      version: "8"
      autoSelect: true
      priority: 1
    - name: tensorflow
      version: "1"
      autoSelect: true
      priority: 1
    - name: tensorflow
      version: "2"
      autoSelect: true
      priority: 1
    - name: onnx
      version: "1"
      autoSelect: true
      priority: 1
    - name: pytorch
      version: "1"
    - name: triton
      version: "2"
      autoSelect: true
      priority: 1
  protocolVersions:
    - v2
    - grpc-v2
  containers:
    - name: kserve-container
      image: kserve-tritonserver:replace
      args:
        - tritonserver
        - --model-store=/mnt/models
        - --grpc-port=9000
        - --http-port=8080
        - --allow-grpc=true
        - --allow-http=true
      securityContext:
        runAsUser: 1000

Import

kubectl apply -f config/runtimes/kserve-tritonserver.yaml

I/O Contract

Inputs

Name Type Required Description
Model artifacts storage URI Yes Model files at /mnt/models (provided by KServe storage initializer)

Outputs

Name Type Description
ClusterServingRuntime Custom Resource Triton runtime available cluster-wide for tensorrt, tensorflow, onnx, pytorch, and triton models
HTTP inference endpoint TCP port 8080 V2 inference protocol HTTP endpoint
gRPC inference endpoint TCP port 9000 V2/grpc-v2 inference protocol gRPC endpoint
Prometheus metrics HTTP port 8002 /metrics Model serving and GPU metrics endpoint

Usage Examples

Apply the runtime

kubectl apply -f config/runtimes/kserve-tritonserver.yaml

Create a TensorRT InferenceService

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: tensorrt-resnet
spec:
  predictor:
    model:
      modelFormat:
        name: tensorrt
      storageUri: "gs://my-bucket/tensorrt/resnet"

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment