Implementation:Triton inference server Server TritonPythonModel BLS

Field	Value
Implementation Name	TritonPythonModel_BLS
Implements	Principle:Triton_inference_server_Server_Component_Model_Preparation
Domains	Model_Serving, Python_Backend, Pipeline_Architecture
Status	Active
Last Updated	2026-02-13 17:00 GMT

Overview

Concrete Python backend interface for implementing custom model logic with Business Logic Scripting (BLS) in Triton. BLS models use the TritonPythonModel class and the triton_python_backend_utils module to process requests and optionally invoke other models in-process.

Description

The TritonPythonModel class is the required interface for all Python backend models in Triton. When used with BLS, the execute() method can create pb_utils.InferenceRequest objects to invoke other models deployed on the same Triton instance, enabling custom orchestration logic without network overhead.

Key capabilities:

Synchronous BLS — Call inference_request.exec() to invoke another model and block until the result is available
Asynchronous BLS — Define async def execute() and call await inference_request.async_exec() for non-blocking invocation
Tensor manipulation — Use pb_utils.Tensor to create tensors and pb_utils.get_input_tensor_by_name() / pb_utils.get_output_tensor_by_name() to extract tensors from requests/responses
Error handling — Use pb_utils.TritonError to propagate errors back to the client

Usage

This implementation is used when:

Creating preprocessing or postprocessing models in Python for an ensemble
Implementing custom business logic that orchestrates multiple model calls
Building models that require data transformation between inference steps
Wrapping external service calls or database lookups within an inference pipeline

Code Reference

Source Location

docs/user_guide/bls.md:L48-97 — Synchronous BLS interface
docs/user_guide/bls.md:L107-153 — Asynchronous BLS interface

Signature

import triton_python_backend_utils as pb_utils
import numpy as np

class TritonPythonModel:
    def initialize(self, args):
        """Called once at model load. args contains model config."""
        pass

    def execute(self, requests):
        """Process each inference request. Can invoke other models via BLS."""
        responses = []
        for request in requests:
            input_tensor = pb_utils.get_input_tensor_by_name(request, "INPUT")

            # BLS: invoke another model
            inference_request = pb_utils.InferenceRequest(
                model_name="other_model",
                requested_output_names=["OUTPUT"],
                inputs=[pb_utils.Tensor("INPUT", input_tensor.as_numpy())]
            )
            inference_response = inference_request.exec()

            output = pb_utils.get_output_tensor_by_name(inference_response, "OUTPUT")
            responses.append(pb_utils.InferenceResponse(output_tensors=[output]))
        return responses

    def finalize(self):
        """Called once at model unload."""
        pass

Import

import triton_python_backend_utils as pb_utils
import numpy as np

Key Parameters

BLS InferenceRequest parameters:

Parameter	Type	Description
model_name	string	Name of the target model to invoke
requested_output_names	list[str]	List of output tensor names to request from the target model
inputs	list[pb_utils.Tensor]	List of input tensors to send to the target model
timeout	int (optional)	Timeout in microseconds for the BLS request
model_version	int (optional)	Specific version of the target model to invoke

I/O Contract

Inputs

Input	Type	Description
Model file	`model.py`	Python file implementing the `TritonPythonModel` class, placed at `model_name/1/model.py`
Model config	`config.pbtxt`	Configuration file specifying backend, inputs, outputs, and instance settings

Outputs

Output	Type	Description
Deployed component model	Triton model	A loaded model ready to serve inference requests or participate in an ensemble

Usage Examples

Synchronous BLS preprocessing model:

import triton_python_backend_utils as pb_utils
import numpy as np

class TritonPythonModel:
    def initialize(self, args):
        self.mean = np.array([0.485, 0.456, 0.406], dtype=np.float32)
        self.std = np.array([0.229, 0.224, 0.225], dtype=np.float32)

    def execute(self, requests):
        responses = []
        for request in requests:
            raw_input = pb_utils.get_input_tensor_by_name(request, "RAW_IMAGE")
            image = raw_input.as_numpy().astype(np.float32) / 255.0
            normalized = (image - self.mean) / self.std

            output_tensor = pb_utils.Tensor("PROCESSED_IMAGE", normalized)
            responses.append(pb_utils.InferenceResponse(output_tensors=[output_tensor]))
        return responses

    def finalize(self):
        pass

Asynchronous BLS model invoking another model:

import triton_python_backend_utils as pb_utils
import numpy as np

class TritonPythonModel:
    def initialize(self, args):
        pass

    async def execute(self, requests):
        responses = []
        for request in requests:
            input_tensor = pb_utils.get_input_tensor_by_name(request, "INPUT")

            inference_request = pb_utils.InferenceRequest(
                model_name="downstream_model",
                requested_output_names=["OUTPUT"],
                inputs=[pb_utils.Tensor("INPUT", input_tensor.as_numpy())]
            )
            inference_response = await inference_request.async_exec()

            output = pb_utils.get_output_tensor_by_name(inference_response, "OUTPUT")
            responses.append(pb_utils.InferenceResponse(output_tensors=[output]))
        return responses

    def finalize(self):
        pass

Component model config.pbtxt:

name: "preprocess"
backend: "python"
max_batch_size: 8

input [
  { name: "RAW_IMAGE", data_type: TYPE_UINT8, dims: [ 224, 224, 3 ] }
]
output [
  { name: "PROCESSED_IMAGE", data_type: TYPE_FP32, dims: [ 224, 224, 3 ] }
]

instance_group [
  { count: 1, kind: KIND_CPU }
]

Related Pages

implements::Principle:Triton_inference_server_Server_Component_Model_Preparation

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment