Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Triton inference server Server Ensemble Infer Request

From Leeroopedia
Field Value
Implementation Name Ensemble_Infer_Request
Implements Principle:Triton_inference_server_Server_Ensemble_Inference
Domains Model_Serving, Inference, Pipeline_Architecture
Status Active
Last Updated 2026-02-13 17:00 GMT

Overview

Concrete inference procedure for ensemble models using tritonclient HTTP/gRPC clients. Ensemble inference uses the same client API as single-model inference — the only difference is that model_name refers to the ensemble model rather than an individual composing model.

Description

Ensemble inference requests are constructed identically to single-model inference requests. The client creates input tensors matching the ensemble's declared inputs, specifies desired outputs, and calls infer() with the ensemble model name. Triton transparently executes the internal DAG and returns the final output tensors.

Key behaviors:

  • Model name targeting — The model_name parameter must be the ensemble model name, NOT any composing model name
  • Input tensor creation — Inputs must match the ensemble's declared input tensor names, shapes, and data types
  • Output selection — Clients can request all or a subset of the ensemble's declared outputs
  • Protocol choice — Both gRPC (tritonclient.grpc) and HTTP (tritonclient.http) clients are supported
  • Result extraction — Results are extracted using result.as_numpy() for the requested output names

Usage

This implementation is used when:

  • Sending inference requests to a deployed ensemble model
  • Building client applications that consume multi-model pipeline outputs
  • Running performance tests against ensemble endpoints
  • Integrating ensemble inference into larger application workflows

Code Reference

Source Location

  • src/http_server.cc:L3667-3795 — HandleInfer server-side implementation
  • qa/L0_simple_ensemble/ensemble_test.py:L99-176 — Client-side ensemble inference test

Signature

import tritonclient.grpc as grpcclient
import numpy as np

client = grpcclient.InferenceServerClient(url="localhost:8001")

# Create inputs matching ensemble input specification
input0 = grpcclient.InferInput("INPUT0", [1, 16], "FP32")
input0.set_data_from_numpy(np.random.rand(1, 16).astype(np.float32))

input1 = grpcclient.InferInput("INPUT1", [1, 16], "FP32")
input1.set_data_from_numpy(np.random.rand(1, 16).astype(np.float32))

# Request outputs
output0 = grpcclient.InferRequestedOutput("OUTPUT0")
output1 = grpcclient.InferRequestedOutput("OUTPUT1")

# Infer against ensemble model name (not composing models)
result = client.infer(
    model_name="ensemble_add_sub",
    inputs=[input0, input1],
    outputs=[output0, output1]
)

# Get results
output0_data = result.as_numpy("OUTPUT0")
output1_data = result.as_numpy("OUTPUT1")

Import

import tritonclient.grpc as grpcclient
import numpy as np

Or for HTTP:

import tritonclient.http as httpclient
import numpy as np

Key Parameters

Parameter Type Description
model_name string Ensemble model name (NOT composing model names)
inputs list[InferInput] List of input tensors matching ensemble input declarations
outputs list[InferRequestedOutput] List of requested output tensors (can be a subset of ensemble outputs)
request_id string (optional) Unique identifier for the request
sequence_id int (optional) Sequence identifier for stateful ensemble pipelines
model_version string (optional) Specific version of the ensemble model to invoke

HTTP endpoint:

POST /v2/models/<ensemble_name>/versions/<ver>/infer

gRPC endpoint:

ModelInfer RPC on default port 8001

I/O Contract

Inputs

Input Type Description
Running server Triton instance A running Triton Inference Server with the ensemble model loaded
Input tensor data numpy arrays Tensor data matching the ensemble's declared input shapes and types
Ensemble model name string The name of the ensemble model as declared in its config.pbtxt

Outputs

Output Type Description
InferResult tritonclient.grpc.InferResult or tritonclient.http.InferResult Result object containing output tensors
Output tensor data numpy arrays Extracted via result.as_numpy("OUTPUT_NAME")

Usage Examples

gRPC client inference:

import tritonclient.grpc as grpcclient
import numpy as np

client = grpcclient.InferenceServerClient(url="localhost:8001")

# Verify ensemble model is ready
assert client.is_model_ready("ensemble_add_sub")

# Prepare inputs
input0 = grpcclient.InferInput("INPUT0", [1, 16], "FP32")
input0.set_data_from_numpy(np.ones((1, 16), dtype=np.float32))

input1 = grpcclient.InferInput("INPUT1", [1, 16], "FP32")
input1.set_data_from_numpy(np.ones((1, 16), dtype=np.float32) * 2)

# Prepare outputs
output0 = grpcclient.InferRequestedOutput("OUTPUT0")
output1 = grpcclient.InferRequestedOutput("OUTPUT1")

# Execute inference
result = client.infer(
    model_name="ensemble_add_sub",
    inputs=[input0, input1],
    outputs=[output0, output1]
)

# Extract results
add_result = result.as_numpy("OUTPUT0")   # INPUT0 + INPUT1
sub_result = result.as_numpy("OUTPUT1")   # INPUT0 - INPUT1

print(f"Add result: {add_result}")
print(f"Sub result: {sub_result}")

HTTP client inference:

import tritonclient.http as httpclient
import numpy as np

client = httpclient.InferenceServerClient(url="localhost:8000")

input0 = httpclient.InferInput("INPUT0", [1, 16], "FP32")
input0.set_data_from_numpy(np.ones((1, 16), dtype=np.float32))

input1 = httpclient.InferInput("INPUT1", [1, 16], "FP32")
input1.set_data_from_numpy(np.ones((1, 16), dtype=np.float32) * 2)

output0 = httpclient.InferRequestedOutput("OUTPUT0")

# Request only one output (partial output request)
result = client.infer(
    model_name="ensemble_add_sub",
    inputs=[input0, input1],
    outputs=[output0]
)

add_result = result.as_numpy("OUTPUT0")
print(f"Add result: {add_result}")

Async gRPC inference with callback:

import tritonclient.grpc as grpcclient
import numpy as np

def callback(result, error):
    if error:
        print(f"Error: {error}")
    else:
        output = result.as_numpy("OUTPUT0")
        print(f"Async result: {output}")

client = grpcclient.InferenceServerClient(url="localhost:8001")

input0 = grpcclient.InferInput("INPUT0", [1, 16], "FP32")
input0.set_data_from_numpy(np.ones((1, 16), dtype=np.float32))

input1 = grpcclient.InferInput("INPUT1", [1, 16], "FP32")
input1.set_data_from_numpy(np.ones((1, 16), dtype=np.float32) * 2)

client.async_infer(
    model_name="ensemble_add_sub",
    inputs=[input0, input1],
    callback=callback,
    outputs=[grpcclient.InferRequestedOutput("OUTPUT0")]
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment