Principle:Triton inference server Server Ensemble Pipeline Design
| Field | Value |
|---|---|
| Principle Name | Ensemble_Pipeline_Design |
| Knowledge Sources | Triton Server|https://github.com/triton-inference-server/server, source::Doc|Ensemble Models|https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/ensemble_models.html |
| Domains | Model_Serving, Pipeline_Architecture, MLOps |
| Status | Active |
| Last Updated | 2026-02-13 17:00 GMT |
Overview
Method of composing multiple models into a directed acyclic graph (DAG) where intermediate tensors flow between models automatically. Ensemble pipeline design enables multi-model inference without client-side orchestration by defining tensor routing at the configuration level.
Description
Ensemble pipeline design creates multi-model inference pipelines by defining tensor routing between composing models. Each step in the ensemble maps its inputs and outputs to named ensemble tensors, creating a DAG that Triton executes automatically. This eliminates the need for client-side orchestration of multi-model inference and reduces network round trips.
Three topology patterns are supported:
- Simple (linear chain) — A sequence of models where the output of one feeds into the next: A → B → C
- Sequence (shared intermediate tensors) — Models that share intermediate tensors: A → B, A → C, with shared tensor from A
- Fan (parallel branches with merge) — Parallel branches that converge: A → [B, C] → D
Each step in the ensemble is defined by:
- A model_name identifying the composing model
- An input_map that maps ensemble tensor names to the composing model's input names
- An output_map that maps the composing model's output names to ensemble tensor names
The ensemble model itself has no model files — it exists purely as a configuration that orchestrates other models.
Usage
Ensemble pipeline design is used when:
- Multiple models must be chained together for a single inference result (e.g., preprocessing → inference → postprocessing)
- Intermediate tensor transfer between models should avoid network round trips
- The client should see a single logical model rather than multiple composing models
- Custom DAG topologies are needed (fan-out, fan-in, shared intermediates)
Theoretical Basis
The ensemble pipeline is based on DAG scheduling:
- Define nodes — Each composing model is a node in the graph
- Define edges — Tensor mappings (input_map / output_map) create directed edges between nodes
- Topological sort — Triton determines execution order from the DAG structure
- Parallel execution — Independent branches execute in parallel where possible
Each step has:
- input_map — Maps ensemble tensor name (value) to model input name (key)
- output_map — Maps model output name (key) to ensemble tensor name (value)
Ensemble inputs feed into the first step(s) of the DAG, and ensemble outputs come from the last step(s). The scheduler resolves dependencies automatically and can execute independent steps concurrently.
Source: docs/user_guide/ensemble_models.md:L60-123, qa/common/gen_ensemble_model_utils.py:L35-305