Implementation:Triton inference server Server Ensemble Infer Request

Field	Value
Implementation Name	Ensemble_Infer_Request
Implements	Principle:Triton_inference_server_Server_Ensemble_Inference
Domains	Model_Serving, Inference, Pipeline_Architecture
Status	Active
Last Updated	2026-02-13 17:00 GMT

Overview

Concrete inference procedure for ensemble models using tritonclient HTTP/gRPC clients. Ensemble inference uses the same client API as single-model inference — the only difference is that model_name refers to the ensemble model rather than an individual composing model.

Description

Ensemble inference requests are constructed identically to single-model inference requests. The client creates input tensors matching the ensemble's declared inputs, specifies desired outputs, and calls infer() with the ensemble model name. Triton transparently executes the internal DAG and returns the final output tensors.

Key behaviors:

Model name targeting — The model_name parameter must be the ensemble model name, NOT any composing model name
Input tensor creation — Inputs must match the ensemble's declared input tensor names, shapes, and data types
Output selection — Clients can request all or a subset of the ensemble's declared outputs
Protocol choice — Both gRPC (tritonclient.grpc) and HTTP (tritonclient.http) clients are supported
Result extraction — Results are extracted using result.as_numpy() for the requested output names

Usage

This implementation is used when:

Sending inference requests to a deployed ensemble model
Building client applications that consume multi-model pipeline outputs
Running performance tests against ensemble endpoints
Integrating ensemble inference into larger application workflows

Code Reference

Source Location

src/http_server.cc:L3667-3795 — HandleInfer server-side implementation
qa/L0_simple_ensemble/ensemble_test.py:L99-176 — Client-side ensemble inference test

Signature

import tritonclient.grpc as grpcclient
import numpy as np

client = grpcclient.InferenceServerClient(url="localhost:8001")

# Create inputs matching ensemble input specification
input0 = grpcclient.InferInput("INPUT0", [1, 16], "FP32")
input0.set_data_from_numpy(np.random.rand(1, 16).astype(np.float32))

input1 = grpcclient.InferInput("INPUT1", [1, 16], "FP32")
input1.set_data_from_numpy(np.random.rand(1, 16).astype(np.float32))

# Request outputs
output0 = grpcclient.InferRequestedOutput("OUTPUT0")
output1 = grpcclient.InferRequestedOutput("OUTPUT1")

# Infer against ensemble model name (not composing models)
result = client.infer(
    model_name="ensemble_add_sub",
    inputs=[input0, input1],
    outputs=[output0, output1]
)

# Get results
output0_data = result.as_numpy("OUTPUT0")
output1_data = result.as_numpy("OUTPUT1")

Import

import tritonclient.grpc as grpcclient
import numpy as np

Or for HTTP:

import tritonclient.http as httpclient
import numpy as np

Key Parameters

Parameter	Type	Description
model_name	string	Ensemble model name (NOT composing model names)
inputs	list[InferInput]	List of input tensors matching ensemble input declarations
outputs	list[InferRequestedOutput]	List of requested output tensors (can be a subset of ensemble outputs)
request_id	string (optional)	Unique identifier for the request
sequence_id	int (optional)	Sequence identifier for stateful ensemble pipelines
model_version	string (optional)	Specific version of the ensemble model to invoke

HTTP endpoint:

POST /v2/models/<ensemble_name>/versions/<ver>/infer

gRPC endpoint:

ModelInfer RPC on default port 8001

I/O Contract

Inputs

Input	Type	Description
Running server	Triton instance	A running Triton Inference Server with the ensemble model loaded
Input tensor data	numpy arrays	Tensor data matching the ensemble's declared input shapes and types
Ensemble model name	string	The name of the ensemble model as declared in its `config.pbtxt`

Outputs

Output	Type	Description
InferResult	`tritonclient.grpc.InferResult` or `tritonclient.http.InferResult`	Result object containing output tensors
Output tensor data	numpy arrays	Extracted via `result.as_numpy("OUTPUT_NAME")`

Usage Examples

gRPC client inference:

import tritonclient.grpc as grpcclient
import numpy as np

client = grpcclient.InferenceServerClient(url="localhost:8001")

# Verify ensemble model is ready
assert client.is_model_ready("ensemble_add_sub")

# Prepare inputs
input0 = grpcclient.InferInput("INPUT0", [1, 16], "FP32")
input0.set_data_from_numpy(np.ones((1, 16), dtype=np.float32))

input1 = grpcclient.InferInput("INPUT1", [1, 16], "FP32")
input1.set_data_from_numpy(np.ones((1, 16), dtype=np.float32) * 2)

# Prepare outputs
output0 = grpcclient.InferRequestedOutput("OUTPUT0")
output1 = grpcclient.InferRequestedOutput("OUTPUT1")

# Execute inference
result = client.infer(
    model_name="ensemble_add_sub",
    inputs=[input0, input1],
    outputs=[output0, output1]
)

# Extract results
add_result = result.as_numpy("OUTPUT0")   # INPUT0 + INPUT1
sub_result = result.as_numpy("OUTPUT1")   # INPUT0 - INPUT1

print(f"Add result: {add_result}")
print(f"Sub result: {sub_result}")

HTTP client inference:

import tritonclient.http as httpclient
import numpy as np

client = httpclient.InferenceServerClient(url="localhost:8000")

input0 = httpclient.InferInput("INPUT0", [1, 16], "FP32")
input0.set_data_from_numpy(np.ones((1, 16), dtype=np.float32))

input1 = httpclient.InferInput("INPUT1", [1, 16], "FP32")
input1.set_data_from_numpy(np.ones((1, 16), dtype=np.float32) * 2)

output0 = httpclient.InferRequestedOutput("OUTPUT0")

# Request only one output (partial output request)
result = client.infer(
    model_name="ensemble_add_sub",
    inputs=[input0, input1],
    outputs=[output0]
)

add_result = result.as_numpy("OUTPUT0")
print(f"Add result: {add_result}")

Async gRPC inference with callback:

import tritonclient.grpc as grpcclient
import numpy as np

def callback(result, error):
    if error:
        print(f"Error: {error}")
    else:
        output = result.as_numpy("OUTPUT0")
        print(f"Async result: {output}")

client = grpcclient.InferenceServerClient(url="localhost:8001")

input0 = grpcclient.InferInput("INPUT0", [1, 16], "FP32")
input0.set_data_from_numpy(np.ones((1, 16), dtype=np.float32))

input1 = grpcclient.InferInput("INPUT1", [1, 16], "FP32")
input1.set_data_from_numpy(np.ones((1, 16), dtype=np.float32) * 2)

client.async_infer(
    model_name="ensemble_add_sub",
    inputs=[input0, input1],
    callback=callback,
    outputs=[grpcclient.InferRequestedOutput("OUTPUT0")]
)

Related Pages

implements::Principle:Triton_inference_server_Server_Ensemble_Inference

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment