Workflow:Triton inference server Server Ensemble Model Pipeline
| Knowledge Sources | |
|---|---|
| Domains | ML_Ops, Model_Serving, Inference, Model_Pipelines |
| Last Updated | 2026-02-13 17:00 GMT |
Overview
End-to-end process for creating a multi-model inference pipeline using Triton's ensemble scheduler, connecting preprocessing, inference, and postprocessing models into a single request flow.
Description
This workflow covers the creation and deployment of ensemble model pipelines on Triton Inference Server. An ensemble model chains multiple models together, routing input and output tensors between steps without requiring intermediate data to leave the server. This eliminates network overhead for multi-stage pipelines such as data preprocessing followed by model inference followed by postprocessing. The ensemble scheduler orchestrates execution order and tensor routing based on the ensemble configuration. Each component model can use a different backend (e.g., Python for pre/postprocessing, TensorRT for inference).
Usage
Execute this workflow when your inference pipeline requires multiple processing stages (e.g., image decoding, feature extraction, classification, result formatting) and you want to reduce latency by running the entire pipeline server-side rather than making multiple round-trip client requests. Common use cases include image classification with preprocessing, NLP pipelines with tokenization, and any multi-model inference chain.
Execution Steps
Step 1: Design the pipeline architecture
Define the sequence of models in the pipeline and map the input/output tensor connections between them. Identify which models handle preprocessing, inference, and postprocessing. Determine the data types and shapes of tensors flowing between each stage.
Key considerations:
- Each step in the ensemble corresponds to a separate model in the repository
- Tensor names in output_map of one step must match tensor names in input_map of the next
- The ensemble model itself does not perform computation; it only routes data
- Consider which backend is most appropriate for each stage
Step 2: Prepare individual component models
Create each component model (preprocessing, inference, postprocessing) as a standalone Triton model with its own directory, version, and config.pbtxt. Each model should work independently before being wired into the ensemble. Python backend models are commonly used for pre/postprocessing logic.
Key considerations:
- Each component model must have its own config.pbtxt with correct input/output definitions
- Version directories and model files must follow the standard Triton repository layout
- Test each component model individually to verify correctness before creating the ensemble
- Python backend models implement the TritonPythonModel interface with initialize, execute, and finalize methods
Step 3: Create the ensemble model configuration
Write the ensemble model's config.pbtxt with platform set to "ensemble" and define the ensemble_scheduling block. This block contains an ordered list of steps, where each step specifies a component model name, input_map (mapping ensemble/previous step outputs to this step's inputs), and output_map (mapping this step's outputs to names used by subsequent steps or the ensemble output).
Key considerations:
- Set platform to "ensemble" (not a backend name)
- The ensemble's top-level inputs and outputs define the external API
- Internal tensor names (prefixed with underscore by convention) connect steps
- max_batch_size of the ensemble must be compatible with all component models
- Stateful model characteristics (sequence batching) propagate through the ensemble
Step 4: Deploy the ensemble pipeline
Place all component models and the ensemble model in the model repository and launch the Triton server. The server loads each component model independently and then loads the ensemble model, which validates the tensor routing configuration.
Key considerations:
- All component models must load successfully before the ensemble can be marked READY
- Check server logs for any tensor shape or type mismatches between connected steps
- The ensemble model appears as a regular model to external clients
- Component models can also be called directly if needed for debugging
Step 5: Send requests to the ensemble
Send inference requests to the ensemble model name. The client only needs to provide the ensemble's declared inputs and receives the ensemble's declared outputs. The internal routing between component models is handled transparently by the ensemble scheduler.
Key considerations:
- Client requests use the ensemble model name, not individual component names
- Input tensors must match the ensemble's top-level input specification
- The ensemble scheduler handles all intermediate tensor transfers server-side
- Both HTTP and gRPC protocols are supported for ensemble requests
Step 6: Validate end_to_end correctness
Verify the ensemble produces correct results by comparing outputs against a reference implementation that runs each pipeline stage sequentially on the client side. Check that intermediate tensor routing preserves data integrity across all steps.
Key considerations:
- Compare ensemble output against manually chaining individual model calls
- Test with various input sizes and batch sizes
- Verify that error conditions in component models propagate correctly through the ensemble
- Monitor metrics to identify any bottleneck stages in the pipeline