Implementation:Triton inference server Server L0 Perf Simple Client
L0 Perf Simple Client
Source File: qa/L0_perf_pyclients/simple_perf_client.py
Language: Python (379 lines)
Domains: Testing, Performance
Purpose
This Python performance client measures synchronous inference throughput and latency against a Triton Inference Server model using either HTTP or gRPC protocols. It performs a configurable warmup phase followed by a measurement phase, reports throughput (inferences/second) and latency percentiles (p50, p90, p95, p99), and optionally writes results to a CSV file for downstream analysis.
Signature
# Key functions:
def parse_model_grpc(model_metadata, model_config) -> tuple
"""Validate gRPC model metadata and extract (max_batch_size, input_name, output_name, dtype)."""
def parse_model_http(model_metadata, model_config) -> tuple
"""Validate HTTP model metadata and extract (max_batch_size, input_name, output_name, dtype)."""
def requestGenerator(input_name, input_data, output_name, dtype, protocol) -> tuple
"""Create InferInput and InferRequestedOutput objects for the given protocol."""
# Command-line interface:
# -v, --verbose Enable verbose output
# -m, --model-name Name of model (required)
# -x, --model-version Model version (default: latest)
# -b, --batch-size Batch size (default: 1)
# -s, --shape Tensor shape dimension (default: 1)
# -u, --url Server URL (default: localhost:8000)
# -i, --protocol HTTP or gRPC (default: HTTP)
# -c, --iteration_count Measurement iterations (default: 1000)
# -w, --warmup_count Warmup iterations (default: 500)
# --csv Output CSV file path
Key Components
Model Validation
Both parse_model_grpc and parse_model_http enforce strict model requirements:
- Exactly 1 input and 1 output
- Variable shape (
-1) on the last dimension - Correct number of dimensions based on whether the model supports batching
def parse_model_grpc(model_metadata, model_config):
if len(model_metadata.inputs) != 1:
raise Exception("expecting 1 input, got {}".format(len(model_metadata.inputs)))
if len(model_metadata.outputs) != 1:
raise Exception("expecting 1 output, got {}".format(len(model_metadata.outputs)))
batch_dim = model_config.max_batch_size > 0
expected_dims = 1 + (1 if batch_dim else 0)
if input_metadata.shape[-1] != -1:
raise Exception("expecting input to have variable shape [-1]")
return (model_config.max_batch_size, input_metadata.name,
output_metadata.name, input_metadata.datatype)
Request Generation
Creates protocol-specific input and output objects. For HTTP, binary data mode is used for efficiency.
def requestGenerator(input_name, input_data, output_name, dtype, protocol):
if protocol.lower() == "grpc":
inputs.append(grpcclient.InferInput(input_name, input_data.shape, dtype))
inputs[0].set_data_from_numpy(input_data)
else:
inputs.append(httpclient.InferInput(input_name, input_data.shape, dtype))
inputs[0].set_data_from_numpy(input_data, binary_data=True)
return inputs, outputs
Warmup Phase
Executes warmup_count (default 500) inference iterations to stabilize JIT compilation, memory allocation, and caching before measurement begins.
for i in range(FLAGS.warmup_count):
inputs, outputs = requestGenerator(input_name, input_data, output_name, dtype, FLAGS.protocol.lower())
triton_client.infer(FLAGS.model_name, inputs, model_version=FLAGS.model_version, outputs=outputs)
Measurement Phase
Runs iteration_count (default 1000) synchronous inference requests, recording per-request latency. Input data is zero-filled with shape [batch_size, shape].
latencies = []
start_time = time.time()
for i in range(FLAGS.iteration_count):
t0 = time.time()
inputs, outputs = requestGenerator(...)
triton_client.infer(FLAGS.model_name, inputs, ...)
latencies.append(time.time() - t0)
end_time = time.time()
Results Reporting
Computes and prints throughput and latency statistics:
throughput = FLAGS.iteration_count / (end_time - start_time)
average_latency = np.average(latencies) * 1000
p50_latency = np.percentile(latencies, 50) * 1000
p90_latency = np.percentile(latencies, 90) * 1000
p95_latency = np.percentile(latencies, 95) * 1000
p99_latency = np.percentile(latencies, 99) * 1000
Output format:
Throughput: {N} infer/sec
Latencies:
Avg: {N} ms
p50: {N} ms
p90: {N} ms
p95: {N} ms
p99: {N} ms
CSV Export
When --csv is specified, writes a single-row CSV with columns: Concurrency, Inferences/Second, p50 latency, p90 latency, p95 latency, p99 latency. Latency values in the CSV are in microseconds (multiplied by 1000 from millisecond values).
Test Flow
- Parse command-line arguments
- Create HTTP or gRPC Triton client
- Retrieve model metadata and config; validate model structure
- Generate zero-filled input data with specified batch size and shape
- Execute warmup iterations (default: 500)
- Execute measurement iterations (default: 1000) with per-request timing
- Compute throughput and latency percentiles
- Print results and optionally write CSV
Dependencies
tritonclient.grpc/tritonclient.http- Triton client librariesnumpy- Array operations and percentile calculationstritonclient.utils-InferenceServerException,triton_to_np_dtype