Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Triton inference server Server QA Test Model Generation

From Leeroopedia


Overview

QA Test Model Generation is the automated creation of test models across all supported backend formats (TensorRT, ONNX Runtime, TensorFlow SavedModel, PyTorch LibTorch, and OpenVINO) for comprehensive QA coverage of Triton Inference Server. This highly polymorphic principle is implemented by eleven distinct model generation utilities, each targeting a specific model topology or behavioral pattern: standard QA models, sequence models, dynamic sequence models, implicit state sequence models, identity passthrough models, implicit state models, ragged input models, reshape models, TensorRT format-specific models, TensorRT plugin models, and ensemble model assembly utilities. Together, these generators produce the thousands of model variants that form the foundation of Triton's regression test suite.

Theoretical Basis

Inference server testing requires a combinatorial explosion of test models to achieve adequate coverage. The key dimensions that must be covered include:

Backend format diversity: Each backend runtime (TensorRT, ONNX Runtime, LibTorch, OpenVINO) has its own model serialization format, memory layout conventions, and execution semantics. A bug in Triton's TensorRT backend would not be caught by ONNX tests, and vice versa. The model generators produce equivalent models in every supported format, enabling the same test logic to validate all backends with identical expected results.

Data type coverage: Inference models operate on a range of numeric types (float32, float16, bfloat16, int8, int16, int32, int64, bool, string). Each data type exercises different code paths in tensor serialization, memory allocation, type conversion, and backend dispatch. The generators parameterize data types and produce model variants for each, using gen_common.py utilities like np_to_model_dtype, np_to_onnx_dtype, np_to_torch_dtype, and np_to_trt_dtype for cross-format type mapping.

Batching mode coverage: Models must be tested with both static batching (fixed max_batch_size) and no-batch configurations. The generators produce both batch and nobatch variants, and some generators additionally produce dynamic batching configurations with various preferred batch sizes.

Input/output topology: Standard QA models implement simple arithmetic operations (add, subtract) with two inputs and two outputs, providing a deterministic relationship between inputs and outputs that enables automated correctness validation. Identity models pass inputs through unchanged, enabling tests to isolate server-level behavior (batching, scheduling, memory management) from model computation. Reshape models test tensor shape transformation across the server boundary.

Sequence model semantics: Sequence models maintain per-sequence state across requests, requiring control inputs (START, END, READY, CORR_ID) to manage the sequence lifecycle. The sequence model generators produce models that implement accumulator logic, enabling tests to verify that sequence state is correctly initialized, updated, and finalized. Dynamic sequence models extend this with runtime correlation ID assignment. Implicit state models use Triton's implicit state management rather than explicit control inputs, testing a different state propagation path.

Ragged tensor support: Ragged models test variable-length input batching, where different elements in a batch have different sequence lengths. This exercises Triton's ragged batching infrastructure, which must correctly handle per-element offsets and lengths.

TensorRT-specific features: The TRT format and plugin generators test TensorRT-specific capabilities including custom plugins, format constraints (linear vs. channel-first memory layouts), and dynamic shape profiles. These generators must interact with the TensorRT builder API to construct optimized engine files.

Ensemble assembly: The ensemble model utilities (gen_ensemble_model_utils.py) do not generate individual backend models but instead create ensemble model configurations that wire together models produced by the other generators. This utility supports simple (linear pipeline), sequence (stateful pipeline), and fan (parallel branch) ensemble topologies, and validates that the composing models are compatible with the ensemble's declared input/output signatures.

Implementation Details

All generators follow a common pattern: they accept command-line arguments specifying the output directory, target backends, and model configuration parameters, then programmatically construct model files and config.pbtxt configuration files. For TensorRT models, this involves building TRT engines via the TensorRT Python API. For ONNX models, numpy arrays are used to define computational graphs. For LibTorch models, torch.jit.script or torch.jit.trace produces serialized TorchScript files. For OpenVINO models, the openvino_save_model utility from gen_common.py handles IR serialization.

The generators are invoked by CI pipeline scripts before test execution, populating versioned data directories (e.g., /data/inferenceserver/${REPO_VERSION}/qa_model_repository/) that the test scripts then reference. This separation between model generation and test execution allows models to be generated once and reused across multiple test suites.

Generator Model Type Primary Test Coverage
GenQaModels Standard arithmetic (add/sub) Basic inference correctness across all backends and dtypes
GenQaSequenceModels Stateful sequence accumulator Sequence batching lifecycle and state management
GenQaDynaSequenceModels Dynamic correlation ID sequences Runtime sequence slot assignment and correlation
GenQaDynaSequenceImplicitModels Dynamic sequences with implicit state Implicit state management with dynamic correlation
GenQaIdentityModels Identity passthrough Server-level batching, scheduling, and memory tests
GenQaImplicitModels Implicit state models Triton-managed state persistence across requests
GenQaRaggedModels Variable-length batched inputs Ragged batching offset and length handling
GenQaReshapeModels Tensor shape transformation Shape propagation and reshape correctness
GenQaTrtFormatModels TensorRT format constraints TRT memory layout and format-specific execution
GenQaTrtPluginModels TensorRT custom plugins Plugin registration, execution, and serialization
GenEnsembleModelUtils Ensemble DAG configurations Pipeline wiring, type validation, and topology assembly

Related Pages

Implementation:Triton_inference_server_Server_GenQaModels Implementation:Triton_inference_server_Server_GenQaSequenceModels Implementation:Triton_inference_server_Server_GenQaDynaSequenceModels Implementation:Triton_inference_server_Server_GenQaDynaSequenceImplicitModels Implementation:Triton_inference_server_Server_GenQaIdentityModels Implementation:Triton_inference_server_Server_GenQaImplicitModels Implementation:Triton_inference_server_Server_GenQaRaggedModels Implementation:Triton_inference_server_Server_GenQaReshapeModels Implementation:Triton_inference_server_Server_GenQaTrtFormatModels Implementation:Triton_inference_server_Server_GenQaTrtPluginModels Implementation:Triton_inference_server_Server_GenEnsembleModelUtils Triton_inference_server_Server

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment