Implementation:Kserve Kserve Triton MMS Perf Test

Knowledge Sources	Kserve_Kserve
Domains	Kubernetes, Model Serving, Performance Testing, Multi-Model Serving
Last Updated	2026-02-13 00:00 GMT

Overview

This file defines a large-scale performance test configuration using the Vegeta load testing tool to benchmark Triton multi-model serving throughput and latency.

Description

The file contains two Kubernetes resources: a Job running the Vegeta load tester and a ConfigMap containing the target URLs and request payloads. The Job runs Vegeta with a 3-minute attack at 50 requests per second against the Triton multi-model serving endpoint, targeting the cifar10 model via the v2 inference protocol. The ConfigMap includes a large JSON payload with CIFAR-10 image data (3x32x32 FP32 pixel values) formatted according to the KFServing v2 inference protocol (with inputs, shape, and datatype fields). The file totals 3,323 lines, with the bulk being the embedded image pixel data.

Usage

Use this configuration to benchmark multi-model serving performance on a KServe cluster with Triton Inference Server. The results provide throughput and latency metrics that are presented in the MMS benchmark documentation. Run it as a Kubernetes Job after deploying a Triton multi-model serving setup.

Code Reference

Source Location

Repository: Kserve_Kserve
File: docs/samples/multimodelserving/triton/perf.yaml

Signature

apiVersion: batch/v1
kind: Job
metadata:
  generateName: torchscript-load-test
spec:
  backoffLimit: 6
  parallelism: 1
  template:
    metadata:
      annotations:
        sidecar.istio.io/inject: "false"
    spec:
      restartPolicy: OnFailure
      containers:
      - args:
        - vegeta -cpus=1 attack -duration=3m -rate=50/1s -targets=/var/vegeta/cfg
          | vegeta report -type=text
        command:
        - sh
        - -c
        image: peterevans/vegeta:latest
        name: vegeta
        volumeMounts:
        - mountPath: /var/vegeta
          name: vegeta-cfg
      volumes:
      - configMap:
          name: torchscript-vegeta-cfg
        name: vegeta-cfg
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: torchscript-vegeta-cfg  # omitted in source; derived from volume ref
data:
  cfg: |
    POST http://triton-mms.default.svc.cluster.local/v2/models/cifar10/infer
    @/var/vegeta/payload
  payload: |
    {
      "inputs": [{
        "name": "INPUT__0",
        "shape": [1, 3, 32, 32],
        "datatype": "FP32",
        "data": [[ ... ]]   # 3x32x32 float pixel values
      }]
    }

Import

kubectl apply -f docs/samples/multimodelserving/triton/perf.yaml

I/O Contract

Job Configuration

Parameter	Value	Description
Duration	3 minutes	Total attack duration
Rate	50 req/s	Request rate
CPUs	1	CPU cores for Vegeta
Report Type	text	Output format for results
Istio Sidecar	disabled	Avoids proxy interference with load testing

Target Endpoint

Property	Value
Method	POST
URL	`http://triton-mms.default.svc.cluster.local/v2/models/cifar10/infer`
Protocol	KFServing v2 Inference Protocol
Model	cifar10

Request Payload

Field	Value	Description
`inputs[0].name`	`INPUT__0`	Input tensor name
`inputs[0].shape`	`[1, 3, 32, 32]`	Batch of 1 CIFAR-10 image
`inputs[0].datatype`	`FP32`	32-bit floating point
`inputs[0].data`	3x32x32 float array	Normalized pixel values

Expected Output

Vegeta produces a text report with:

Metric	Description
Requests	Total number of requests sent
Rate	Actual request rate achieved
Throughput	Successful requests per second
Latencies	Mean, 50th, 95th, 99th percentile latencies
Success	Percentage of successful requests
Status Codes	Distribution of HTTP status codes

Usage Examples

# Deploy the Triton MMS setup first, then run the perf test
kubectl apply -f docs/samples/multimodelserving/triton/perf.yaml

# Monitor the job
kubectl get jobs -l job-name=torchscript-load-test

# View the results
kubectl logs job/torchscript-load-test

Related Pages

Kserve_Kserve_Batcher_Sample_Input - Similar CIFAR-10 input data used for batcher testing

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment