Implementation:Triton inference server Server TritonPythonModel BLS
| Field | Value |
|---|---|
| Implementation Name | TritonPythonModel_BLS |
| Implements | Principle:Triton_inference_server_Server_Component_Model_Preparation |
| Domains | Model_Serving, Python_Backend, Pipeline_Architecture |
| Status | Active |
| Last Updated | 2026-02-13 17:00 GMT |
Overview
Concrete Python backend interface for implementing custom model logic with Business Logic Scripting (BLS) in Triton. BLS models use the TritonPythonModel class and the triton_python_backend_utils module to process requests and optionally invoke other models in-process.
Description
The TritonPythonModel class is the required interface for all Python backend models in Triton. When used with BLS, the execute() method can create pb_utils.InferenceRequest objects to invoke other models deployed on the same Triton instance, enabling custom orchestration logic without network overhead.
Key capabilities:
- Synchronous BLS — Call
inference_request.exec()to invoke another model and block until the result is available - Asynchronous BLS — Define
async def execute()and callawait inference_request.async_exec()for non-blocking invocation - Tensor manipulation — Use
pb_utils.Tensorto create tensors andpb_utils.get_input_tensor_by_name()/pb_utils.get_output_tensor_by_name()to extract tensors from requests/responses - Error handling — Use
pb_utils.TritonErrorto propagate errors back to the client
Usage
This implementation is used when:
- Creating preprocessing or postprocessing models in Python for an ensemble
- Implementing custom business logic that orchestrates multiple model calls
- Building models that require data transformation between inference steps
- Wrapping external service calls or database lookups within an inference pipeline
Code Reference
Source Location
docs/user_guide/bls.md:L48-97— Synchronous BLS interfacedocs/user_guide/bls.md:L107-153— Asynchronous BLS interface
Signature
import triton_python_backend_utils as pb_utils
import numpy as np
class TritonPythonModel:
def initialize(self, args):
"""Called once at model load. args contains model config."""
pass
def execute(self, requests):
"""Process each inference request. Can invoke other models via BLS."""
responses = []
for request in requests:
input_tensor = pb_utils.get_input_tensor_by_name(request, "INPUT")
# BLS: invoke another model
inference_request = pb_utils.InferenceRequest(
model_name="other_model",
requested_output_names=["OUTPUT"],
inputs=[pb_utils.Tensor("INPUT", input_tensor.as_numpy())]
)
inference_response = inference_request.exec()
output = pb_utils.get_output_tensor_by_name(inference_response, "OUTPUT")
responses.append(pb_utils.InferenceResponse(output_tensors=[output]))
return responses
def finalize(self):
"""Called once at model unload."""
pass
Import
import triton_python_backend_utils as pb_utils
import numpy as np
Key Parameters
BLS InferenceRequest parameters:
| Parameter | Type | Description |
|---|---|---|
| model_name | string | Name of the target model to invoke |
| requested_output_names | list[str] | List of output tensor names to request from the target model |
| inputs | list[pb_utils.Tensor] | List of input tensors to send to the target model |
| timeout | int (optional) | Timeout in microseconds for the BLS request |
| model_version | int (optional) | Specific version of the target model to invoke |
I/O Contract
Inputs
| Input | Type | Description |
|---|---|---|
| Model file | model.py |
Python file implementing the TritonPythonModel class, placed at model_name/1/model.py
|
| Model config | config.pbtxt |
Configuration file specifying backend, inputs, outputs, and instance settings |
Outputs
| Output | Type | Description |
|---|---|---|
| Deployed component model | Triton model | A loaded model ready to serve inference requests or participate in an ensemble |
Usage Examples
Synchronous BLS preprocessing model:
import triton_python_backend_utils as pb_utils
import numpy as np
class TritonPythonModel:
def initialize(self, args):
self.mean = np.array([0.485, 0.456, 0.406], dtype=np.float32)
self.std = np.array([0.229, 0.224, 0.225], dtype=np.float32)
def execute(self, requests):
responses = []
for request in requests:
raw_input = pb_utils.get_input_tensor_by_name(request, "RAW_IMAGE")
image = raw_input.as_numpy().astype(np.float32) / 255.0
normalized = (image - self.mean) / self.std
output_tensor = pb_utils.Tensor("PROCESSED_IMAGE", normalized)
responses.append(pb_utils.InferenceResponse(output_tensors=[output_tensor]))
return responses
def finalize(self):
pass
Asynchronous BLS model invoking another model:
import triton_python_backend_utils as pb_utils
import numpy as np
class TritonPythonModel:
def initialize(self, args):
pass
async def execute(self, requests):
responses = []
for request in requests:
input_tensor = pb_utils.get_input_tensor_by_name(request, "INPUT")
inference_request = pb_utils.InferenceRequest(
model_name="downstream_model",
requested_output_names=["OUTPUT"],
inputs=[pb_utils.Tensor("INPUT", input_tensor.as_numpy())]
)
inference_response = await inference_request.async_exec()
output = pb_utils.get_output_tensor_by_name(inference_response, "OUTPUT")
responses.append(pb_utils.InferenceResponse(output_tensors=[output]))
return responses
def finalize(self):
pass
Component model config.pbtxt:
name: "preprocess"
backend: "python"
max_batch_size: 8
input [
{ name: "RAW_IMAGE", data_type: TYPE_UINT8, dims: [ 224, 224, 3 ] }
]
output [
{ name: "PROCESSED_IMAGE", data_type: TYPE_FP32, dims: [ 224, 224, 3 ] }
]
instance_group [
{ count: 1, kind: KIND_CPU }
]