Principle:Triton inference server Server QA Test Model Generation

Overview

QA Test Model Generation is the automated creation of test models across all supported backend formats (TensorRT, ONNX Runtime, TensorFlow SavedModel, PyTorch LibTorch, and OpenVINO) for comprehensive QA coverage of Triton Inference Server. This highly polymorphic principle is implemented by eleven distinct model generation utilities, each targeting a specific model topology or behavioral pattern: standard QA models, sequence models, dynamic sequence models, implicit state sequence models, identity passthrough models, implicit state models, ragged input models, reshape models, TensorRT format-specific models, TensorRT plugin models, and ensemble model assembly utilities. Together, these generators produce the thousands of model variants that form the foundation of Triton's regression test suite.

Theoretical Basis

Inference server testing requires a combinatorial explosion of test models to achieve adequate coverage. The key dimensions that must be covered include:

Backend format diversity: Each backend runtime (TensorRT, ONNX Runtime, LibTorch, OpenVINO) has its own model serialization format, memory layout conventions, and execution semantics. A bug in Triton's TensorRT backend would not be caught by ONNX tests, and vice versa. The model generators produce equivalent models in every supported format, enabling the same test logic to validate all backends with identical expected results.

Data type coverage: Inference models operate on a range of numeric types (float32, float16, bfloat16, int8, int16, int32, int64, bool, string). Each data type exercises different code paths in tensor serialization, memory allocation, type conversion, and backend dispatch. The generators parameterize data types and produce model variants for each, using gen_common.py utilities like np_to_model_dtype, np_to_onnx_dtype, np_to_torch_dtype, and np_to_trt_dtype for cross-format type mapping.

Batching mode coverage: Models must be tested with both static batching (fixed max_batch_size) and no-batch configurations. The generators produce both batch and nobatch variants, and some generators additionally produce dynamic batching configurations with various preferred batch sizes.

Input/output topology: Standard QA models implement simple arithmetic operations (add, subtract) with two inputs and two outputs, providing a deterministic relationship between inputs and outputs that enables automated correctness validation. Identity models pass inputs through unchanged, enabling tests to isolate server-level behavior (batching, scheduling, memory management) from model computation. Reshape models test tensor shape transformation across the server boundary.

Sequence model semantics: Sequence models maintain per-sequence state across requests, requiring control inputs (START, END, READY, CORR_ID) to manage the sequence lifecycle. The sequence model generators produce models that implement accumulator logic, enabling tests to verify that sequence state is correctly initialized, updated, and finalized. Dynamic sequence models extend this with runtime correlation ID assignment. Implicit state models use Triton's implicit state management rather than explicit control inputs, testing a different state propagation path.

Ragged tensor support: Ragged models test variable-length input batching, where different elements in a batch have different sequence lengths. This exercises Triton's ragged batching infrastructure, which must correctly handle per-element offsets and lengths.

TensorRT-specific features: The TRT format and plugin generators test TensorRT-specific capabilities including custom plugins, format constraints (linear vs. channel-first memory layouts), and dynamic shape profiles. These generators must interact with the TensorRT builder API to construct optimized engine files.

Ensemble assembly: The ensemble model utilities (gen_ensemble_model_utils.py) do not generate individual backend models but instead create ensemble model configurations that wire together models produced by the other generators. This utility supports simple (linear pipeline), sequence (stateful pipeline), and fan (parallel branch) ensemble topologies, and validates that the composing models are compatible with the ensemble's declared input/output signatures.

Implementation Details

All generators follow a common pattern: they accept command-line arguments specifying the output directory, target backends, and model configuration parameters, then programmatically construct model files and config.pbtxt configuration files. For TensorRT models, this involves building TRT engines via the TensorRT Python API. For ONNX models, numpy arrays are used to define computational graphs. For LibTorch models, torch.jit.script or torch.jit.trace produces serialized TorchScript files. For OpenVINO models, the openvino_save_model utility from gen_common.py handles IR serialization.

The generators are invoked by CI pipeline scripts before test execution, populating versioned data directories (e.g., /data/inferenceserver/${REPO_VERSION}/qa_model_repository/) that the test scripts then reference. This separation between model generation and test execution allows models to be generated once and reused across multiple test suites.

Generator	Model Type	Primary Test Coverage
GenQaModels	Standard arithmetic (add/sub)	Basic inference correctness across all backends and dtypes
GenQaSequenceModels	Stateful sequence accumulator	Sequence batching lifecycle and state management
GenQaDynaSequenceModels	Dynamic correlation ID sequences	Runtime sequence slot assignment and correlation
GenQaDynaSequenceImplicitModels	Dynamic sequences with implicit state	Implicit state management with dynamic correlation
GenQaIdentityModels	Identity passthrough	Server-level batching, scheduling, and memory tests
GenQaImplicitModels	Implicit state models	Triton-managed state persistence across requests
GenQaRaggedModels	Variable-length batched inputs	Ragged batching offset and length handling
GenQaReshapeModels	Tensor shape transformation	Shape propagation and reshape correctness
GenQaTrtFormatModels	TensorRT format constraints	TRT memory layout and format-specific execution
GenQaTrtPluginModels	TensorRT custom plugins	Plugin registration, execution, and serialization
GenEnsembleModelUtils	Ensemble DAG configurations	Pipeline wiring, type validation, and topology assembly

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment

Overview

Theoretical Basis

Implementation Details

Related Pages

Page Connections