Principle:Triton inference server Server Ensemble Pipeline Design

Field	Value
Principle Name	Ensemble_Pipeline_Design
Knowledge Sources	Triton Server\|https://github.com/triton-inference-server/server, source::Doc\|Ensemble Models\|https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/ensemble_models.html
Domains	Model_Serving, Pipeline_Architecture, MLOps
Status	Active
Last Updated	2026-02-13 17:00 GMT

Overview

Method of composing multiple models into a directed acyclic graph (DAG) where intermediate tensors flow between models automatically. Ensemble pipeline design enables multi-model inference without client-side orchestration by defining tensor routing at the configuration level.

Description

Ensemble pipeline design creates multi-model inference pipelines by defining tensor routing between composing models. Each step in the ensemble maps its inputs and outputs to named ensemble tensors, creating a DAG that Triton executes automatically. This eliminates the need for client-side orchestration of multi-model inference and reduces network round trips.

Three topology patterns are supported:

Simple (linear chain) — A sequence of models where the output of one feeds into the next: A → B → C
Sequence (shared intermediate tensors) — Models that share intermediate tensors: A → B, A → C, with shared tensor from A
Fan (parallel branches with merge) — Parallel branches that converge: A → [B, C] → D

Each step in the ensemble is defined by:

A model_name identifying the composing model
An input_map that maps ensemble tensor names to the composing model's input names
An output_map that maps the composing model's output names to ensemble tensor names

The ensemble model itself has no model files — it exists purely as a configuration that orchestrates other models.

Usage

Ensemble pipeline design is used when:

Multiple models must be chained together for a single inference result (e.g., preprocessing → inference → postprocessing)
Intermediate tensor transfer between models should avoid network round trips
The client should see a single logical model rather than multiple composing models
Custom DAG topologies are needed (fan-out, fan-in, shared intermediates)

Theoretical Basis

The ensemble pipeline is based on DAG scheduling:

Define nodes — Each composing model is a node in the graph
Define edges — Tensor mappings (input_map / output_map) create directed edges between nodes
Topological sort — Triton determines execution order from the DAG structure
Parallel execution — Independent branches execute in parallel where possible

Each step has:

input_map — Maps ensemble tensor name (value) to model input name (key)
output_map — Maps model output name (key) to ensemble tensor name (value)

Ensemble inputs feed into the first step(s) of the DAG, and ensemble outputs come from the last step(s). The scheduler resolves dependencies automatically and can execute independent steps concurrently.

Source: docs/user_guide/ensemble_models.md:L60-123, qa/common/gen_ensemble_model_utils.py:L35-305

Related Pages

Implementation:Triton_inference_server_Server_Ensemble_Scheduling_Schema

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment