Implementation:Triton inference server Server Ensemble Infer Request
| Field | Value |
|---|---|
| Implementation Name | Ensemble_Infer_Request |
| Implements | Principle:Triton_inference_server_Server_Ensemble_Inference |
| Domains | Model_Serving, Inference, Pipeline_Architecture |
| Status | Active |
| Last Updated | 2026-02-13 17:00 GMT |
Overview
Concrete inference procedure for ensemble models using tritonclient HTTP/gRPC clients. Ensemble inference uses the same client API as single-model inference — the only difference is that model_name refers to the ensemble model rather than an individual composing model.
Description
Ensemble inference requests are constructed identically to single-model inference requests. The client creates input tensors matching the ensemble's declared inputs, specifies desired outputs, and calls infer() with the ensemble model name. Triton transparently executes the internal DAG and returns the final output tensors.
Key behaviors:
- Model name targeting — The
model_nameparameter must be the ensemble model name, NOT any composing model name - Input tensor creation — Inputs must match the ensemble's declared input tensor names, shapes, and data types
- Output selection — Clients can request all or a subset of the ensemble's declared outputs
- Protocol choice — Both gRPC (
tritonclient.grpc) and HTTP (tritonclient.http) clients are supported - Result extraction — Results are extracted using
result.as_numpy()for the requested output names
Usage
This implementation is used when:
- Sending inference requests to a deployed ensemble model
- Building client applications that consume multi-model pipeline outputs
- Running performance tests against ensemble endpoints
- Integrating ensemble inference into larger application workflows
Code Reference
Source Location
src/http_server.cc:L3667-3795— HandleInfer server-side implementationqa/L0_simple_ensemble/ensemble_test.py:L99-176— Client-side ensemble inference test
Signature
import tritonclient.grpc as grpcclient
import numpy as np
client = grpcclient.InferenceServerClient(url="localhost:8001")
# Create inputs matching ensemble input specification
input0 = grpcclient.InferInput("INPUT0", [1, 16], "FP32")
input0.set_data_from_numpy(np.random.rand(1, 16).astype(np.float32))
input1 = grpcclient.InferInput("INPUT1", [1, 16], "FP32")
input1.set_data_from_numpy(np.random.rand(1, 16).astype(np.float32))
# Request outputs
output0 = grpcclient.InferRequestedOutput("OUTPUT0")
output1 = grpcclient.InferRequestedOutput("OUTPUT1")
# Infer against ensemble model name (not composing models)
result = client.infer(
model_name="ensemble_add_sub",
inputs=[input0, input1],
outputs=[output0, output1]
)
# Get results
output0_data = result.as_numpy("OUTPUT0")
output1_data = result.as_numpy("OUTPUT1")
Import
import tritonclient.grpc as grpcclient
import numpy as np
Or for HTTP:
import tritonclient.http as httpclient
import numpy as np
Key Parameters
| Parameter | Type | Description |
|---|---|---|
| model_name | string | Ensemble model name (NOT composing model names) |
| inputs | list[InferInput] | List of input tensors matching ensemble input declarations |
| outputs | list[InferRequestedOutput] | List of requested output tensors (can be a subset of ensemble outputs) |
| request_id | string (optional) | Unique identifier for the request |
| sequence_id | int (optional) | Sequence identifier for stateful ensemble pipelines |
| model_version | string (optional) | Specific version of the ensemble model to invoke |
HTTP endpoint:
POST /v2/models/<ensemble_name>/versions/<ver>/infer
gRPC endpoint:
ModelInfer RPC on default port 8001
I/O Contract
Inputs
| Input | Type | Description |
|---|---|---|
| Running server | Triton instance | A running Triton Inference Server with the ensemble model loaded |
| Input tensor data | numpy arrays | Tensor data matching the ensemble's declared input shapes and types |
| Ensemble model name | string | The name of the ensemble model as declared in its config.pbtxt
|
Outputs
| Output | Type | Description |
|---|---|---|
| InferResult | tritonclient.grpc.InferResult or tritonclient.http.InferResult |
Result object containing output tensors |
| Output tensor data | numpy arrays | Extracted via result.as_numpy("OUTPUT_NAME")
|
Usage Examples
gRPC client inference:
import tritonclient.grpc as grpcclient
import numpy as np
client = grpcclient.InferenceServerClient(url="localhost:8001")
# Verify ensemble model is ready
assert client.is_model_ready("ensemble_add_sub")
# Prepare inputs
input0 = grpcclient.InferInput("INPUT0", [1, 16], "FP32")
input0.set_data_from_numpy(np.ones((1, 16), dtype=np.float32))
input1 = grpcclient.InferInput("INPUT1", [1, 16], "FP32")
input1.set_data_from_numpy(np.ones((1, 16), dtype=np.float32) * 2)
# Prepare outputs
output0 = grpcclient.InferRequestedOutput("OUTPUT0")
output1 = grpcclient.InferRequestedOutput("OUTPUT1")
# Execute inference
result = client.infer(
model_name="ensemble_add_sub",
inputs=[input0, input1],
outputs=[output0, output1]
)
# Extract results
add_result = result.as_numpy("OUTPUT0") # INPUT0 + INPUT1
sub_result = result.as_numpy("OUTPUT1") # INPUT0 - INPUT1
print(f"Add result: {add_result}")
print(f"Sub result: {sub_result}")
HTTP client inference:
import tritonclient.http as httpclient
import numpy as np
client = httpclient.InferenceServerClient(url="localhost:8000")
input0 = httpclient.InferInput("INPUT0", [1, 16], "FP32")
input0.set_data_from_numpy(np.ones((1, 16), dtype=np.float32))
input1 = httpclient.InferInput("INPUT1", [1, 16], "FP32")
input1.set_data_from_numpy(np.ones((1, 16), dtype=np.float32) * 2)
output0 = httpclient.InferRequestedOutput("OUTPUT0")
# Request only one output (partial output request)
result = client.infer(
model_name="ensemble_add_sub",
inputs=[input0, input1],
outputs=[output0]
)
add_result = result.as_numpy("OUTPUT0")
print(f"Add result: {add_result}")
Async gRPC inference with callback:
import tritonclient.grpc as grpcclient
import numpy as np
def callback(result, error):
if error:
print(f"Error: {error}")
else:
output = result.as_numpy("OUTPUT0")
print(f"Async result: {output}")
client = grpcclient.InferenceServerClient(url="localhost:8001")
input0 = grpcclient.InferInput("INPUT0", [1, 16], "FP32")
input0.set_data_from_numpy(np.ones((1, 16), dtype=np.float32))
input1 = grpcclient.InferInput("INPUT1", [1, 16], "FP32")
input1.set_data_from_numpy(np.ones((1, 16), dtype=np.float32) * 2)
client.async_infer(
model_name="ensemble_add_sub",
inputs=[input0, input1],
callback=callback,
outputs=[grpcclient.InferRequestedOutput("OUTPUT0")]
)