Implementation:Triton inference server Server L0 Perf Simple Client

L0 Perf Simple Client

Source File: qa/L0_perf_pyclients/simple_perf_client.py
Language: Python (379 lines)
Domains: Testing, Performance

Purpose

This Python performance client measures synchronous inference throughput and latency against a Triton Inference Server model using either HTTP or gRPC protocols. It performs a configurable warmup phase followed by a measurement phase, reports throughput (inferences/second) and latency percentiles (p50, p90, p95, p99), and optionally writes results to a CSV file for downstream analysis.

Signature

# Key functions:
def parse_model_grpc(model_metadata, model_config) -> tuple
    """Validate gRPC model metadata and extract (max_batch_size, input_name, output_name, dtype)."""

def parse_model_http(model_metadata, model_config) -> tuple
    """Validate HTTP model metadata and extract (max_batch_size, input_name, output_name, dtype)."""

def requestGenerator(input_name, input_data, output_name, dtype, protocol) -> tuple
    """Create InferInput and InferRequestedOutput objects for the given protocol."""

# Command-line interface:
#   -v, --verbose           Enable verbose output
#   -m, --model-name        Name of model (required)
#   -x, --model-version     Model version (default: latest)
#   -b, --batch-size        Batch size (default: 1)
#   -s, --shape             Tensor shape dimension (default: 1)
#   -u, --url               Server URL (default: localhost:8000)
#   -i, --protocol          HTTP or gRPC (default: HTTP)
#   -c, --iteration_count   Measurement iterations (default: 1000)
#   -w, --warmup_count      Warmup iterations (default: 500)
#   --csv                   Output CSV file path

Key Components

Model Validation

Both parse_model_grpc and parse_model_http enforce strict model requirements:

Exactly 1 input and 1 output
Variable shape (-1) on the last dimension
Correct number of dimensions based on whether the model supports batching

def parse_model_grpc(model_metadata, model_config):
    if len(model_metadata.inputs) != 1:
        raise Exception("expecting 1 input, got {}".format(len(model_metadata.inputs)))
    if len(model_metadata.outputs) != 1:
        raise Exception("expecting 1 output, got {}".format(len(model_metadata.outputs)))

    batch_dim = model_config.max_batch_size > 0
    expected_dims = 1 + (1 if batch_dim else 0)
    if input_metadata.shape[-1] != -1:
        raise Exception("expecting input to have variable shape [-1]")

    return (model_config.max_batch_size, input_metadata.name,
            output_metadata.name, input_metadata.datatype)

Request Generation

Creates protocol-specific input and output objects. For HTTP, binary data mode is used for efficiency.

def requestGenerator(input_name, input_data, output_name, dtype, protocol):
    if protocol.lower() == "grpc":
        inputs.append(grpcclient.InferInput(input_name, input_data.shape, dtype))
        inputs[0].set_data_from_numpy(input_data)
    else:
        inputs.append(httpclient.InferInput(input_name, input_data.shape, dtype))
        inputs[0].set_data_from_numpy(input_data, binary_data=True)
    return inputs, outputs

Warmup Phase

Executes warmup_count (default 500) inference iterations to stabilize JIT compilation, memory allocation, and caching before measurement begins.

for i in range(FLAGS.warmup_count):
    inputs, outputs = requestGenerator(input_name, input_data, output_name, dtype, FLAGS.protocol.lower())
    triton_client.infer(FLAGS.model_name, inputs, model_version=FLAGS.model_version, outputs=outputs)

Measurement Phase

Runs iteration_count (default 1000) synchronous inference requests, recording per-request latency. Input data is zero-filled with shape [batch_size, shape].

latencies = []
start_time = time.time()
for i in range(FLAGS.iteration_count):
    t0 = time.time()
    inputs, outputs = requestGenerator(...)
    triton_client.infer(FLAGS.model_name, inputs, ...)
    latencies.append(time.time() - t0)
end_time = time.time()

Results Reporting

Computes and prints throughput and latency statistics:

throughput = FLAGS.iteration_count / (end_time - start_time)
average_latency = np.average(latencies) * 1000
p50_latency = np.percentile(latencies, 50) * 1000
p90_latency = np.percentile(latencies, 90) * 1000
p95_latency = np.percentile(latencies, 95) * 1000
p99_latency = np.percentile(latencies, 99) * 1000

Output format:

Throughput: {N} infer/sec
Latencies:
    Avg: {N} ms
    p50: {N} ms
    p90: {N} ms
    p95: {N} ms
    p99: {N} ms

CSV Export

When --csv is specified, writes a single-row CSV with columns: Concurrency, Inferences/Second, p50 latency, p90 latency, p95 latency, p99 latency. Latency values in the CSV are in microseconds (multiplied by 1000 from millisecond values).

Test Flow

Parse command-line arguments
Create HTTP or gRPC Triton client
Retrieve model metadata and config; validate model structure
Generate zero-filled input data with specified batch size and shape
Execute warmup iterations (default: 500)
Execute measurement iterations (default: 1000) with per-request timing
Compute throughput and latency percentiles
Print results and optionally write CSV

Dependencies

tritonclient.grpc / tritonclient.http - Triton client libraries
numpy - Array operations and percentile calculations
tritonclient.utils - InferenceServerException, triton_to_np_dtype

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment