Principle:Triton inference server Server Inference Correctness Testing
Overview
Inference Correctness Testing is the QA principle that validates the numerical accuracy and behavioral correctness of inference results produced by Triton Inference Server across all supported execution modes. This encompasses standard inference, CUDA graph-accelerated execution, and dynamic shape handling for models with variable-dimension inputs. Because Triton sits between the client and the model execution runtime, it must ensure that the transformations it applies to request data -- batching, memory transfers, padding, shape manipulation -- preserve the mathematical properties of the inference operation. A server that is fast but numerically wrong is worse than useless; it is dangerous.
Theoretical Basis
The Correctness Guarantee
The fundamental contract of an inference server is: given the same model and the same input, the server must produce the same output as running the model directly in its native framework (TensorRT, ONNX Runtime, PyTorch, etc.), within the bounds of floating-point non-determinism inherent to the execution platform. Testing this contract requires:
- Golden output comparison: Running inference through Triton and directly through the framework, then comparing outputs element-by-element with appropriate tolerance thresholds (absolute and relative error bounds that account for floating-point non-associativity in parallel reductions).
- Datatype fidelity: Verifying that all supported datatypes -- FP32, FP16, INT8, INT16, INT32, INT64, BOOL, and string (BYTES) -- are correctly handled through the entire pipeline without implicit type coercion or precision loss.
- Batch correctness: Ensuring that the response for the i-th request in a dynamically assembled batch exactly matches what would have been returned for that request in isolation. This is the single most important correctness property for a batching inference server.
CUDA Graph Execution
CUDA graphs are a GPU acceleration mechanism that captures a sequence of GPU operations (kernel launches, memory copies) into a replayable graph, eliminating the CPU overhead of launching individual kernels on every inference. Triton supports CUDA graph capture for TensorRT and other backends. However, CUDA graphs introduce strict constraints:
- Fixed shapes: A captured CUDA graph is bound to specific input shapes. If a request arrives with a shape that does not match any captured graph, Triton must fall back to regular execution. Testing must verify that this fallback is seamless and correct.
- Memory address stability: CUDA graphs capture device memory pointers. If the memory layout changes between capture and replay, the graph produces garbage output. Tests must verify that Triton's memory manager maintains address stability for CUDA graph-eligible buffers.
- Graph cache management: Triton maintains a cache of captured graphs keyed by input shape combinations. Tests must verify that cache lookup is correct (no false hits from shape hash collisions), that cache eviction under memory pressure does not cause errors, and that graph re-capture after eviction produces correct results.
Dynamic Shape Handling
Models with dynamic shapes (e.g., variable sequence length, variable image resolution) present unique challenges:
- TensorRT dynamic shapes: TensorRT optimization profiles define minimum, optimal, and maximum dimensions for each dynamic axis. Triton must select the correct optimization profile for each request's input shapes. An incorrect profile selection can produce wrong results (if the shapes exceed the profile's maximum) or suboptimal performance (if a less efficient profile is selected).
- Shape tensor inputs: Some models accept "shape tensors" that describe the dimensions of other inputs. These must be handled distinctly from data tensors -- they reside in CPU memory even when data tensors are on GPU, and they must be correctly propagated to the backend.
- Ragged batching: When batching requests with different dynamic dimensions, Triton may need to pad inputs to a common shape. The padding values and the mechanism for communicating actual (non-padded) lengths to the backend must be correct.
Cross-Mode Consistency
A critical testing property is that the same model produces bitwise-identical results (or results within floating-point tolerance) regardless of which execution mode is active:
standard_output = triton_infer(model, input, cuda_graphs=False)
graph_output = triton_infer(model, input, cuda_graphs=True)
assert allclose(standard_output, graph_output, rtol=1e-5, atol=1e-6)
Any divergence between execution modes indicates a bug in graph capture, memory management, or shape handling.
| Execution Mode | Key Risk | Verification Method |
|---|---|---|
| Standard | Batching corrupts request/response mapping | Per-request golden comparison |
| CUDA Graph | Stale memory addresses in captured graph | Cross-mode output comparison |
| Dynamic Shape | Wrong optimization profile selected | Boundary shape testing (min/opt/max) |
| Mixed Precision | Implicit FP32-to-FP16 conversion loss | Tolerance-aware numerical comparison |
Related Pages
Implementation:Triton_inference_server_Server_L0_Infer_Test Implementation:Triton_inference_server_Server_L0_Cuda_Graph_Test Implementation:Triton_inference_server_Server_L0_Trt_Dynamic_Shape_Test Triton_inference_server_Server