Environment:Bentoml BentoML Triton Inference Server
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, ML_Serving |
| Last Updated | 2026-02-13 16:00 GMT |
Overview
Optional NVIDIA Triton Inference Server integration requiring the `tritonserver` binary on PATH and `tritonclient` >= 2.29.0 for high-performance model serving within BentoML.
Description
BentoML integrates with NVIDIA Triton Inference Server as an alternative runner backend. The Triton integration allows models to be served through Triton's optimized inference pipeline while being orchestrated by BentoML's service layer. The `tritonserver` binary must be available on the system PATH. This is typically achieved by using the official NVIDIA Triton container image as a base image. The Python client library `tritonclient` is required for communication between BentoML and the Triton server instances.
Usage
Use this environment when serving models through NVIDIA Triton Inference Server for optimized inference with features like dynamic batching, model ensembles, and multi-framework support. Required when using `bentoml.triton` integration or when a runner is configured as a Triton runner.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux | Triton is Linux-only; macOS excluded for tritonclient[all] |
| Hardware | NVIDIA GPU (recommended) | Triton supports CPU mode but is optimized for GPU |
| Binary | `tritonserver` on PATH | Use NVIDIA NGC container image as base |
Dependencies
System Packages
- `tritonserver` binary (from NVIDIA NGC container or manual install)
Python Packages
- `tritonclient` >= 2.29.0
- `tritonclient[all]` (on Linux; excluded on macOS via `sys_platform != 'darwin'`)
Credentials
No specific credentials required. Access to NVIDIA NGC container registry may require an NGC API key for pulling the Triton container image.
Quick Install
# Install tritonclient Python package
pip install "bentoml[triton]"
# For the tritonserver binary, use the NGC container:
# docker pull nvcr.io/nvidia/tritonserver:24.01-py3
Code Evidence
Triton binary detection from `serving.py:196-202`:
def find_triton_binary():
binary = shutil.which("tritonserver")
if binary is None:
raise RuntimeError(
"'tritonserver' is not found on PATH. Make sure to include the compiled "
"binary in PATH to proceed.\nIf you are running this inside a container, "
"make sure to use the official Triton container image as a 'base_image'. "
"See https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver."
)
return binary
Optional dependency declaration from `pyproject.toml:93`:
triton = ["tritonclient>=2.29.0", "tritonclient[all]; sys_platform != 'darwin'"]
Triton runner integration in serving from `serving.py:402-422`:
else:
# Make sure that the tritonserver uses the correct protocol
runner_bind_map[runner.name] = runner.protocol_address
cli_args = runner.cli_args + [
(f"--http-port={runner.protocol_address.split(':')[-1]}"
if runner.tritonserver_type == "http"
else f"--grpc-port={runner.protocol_address.split(':')[-1]}")
]
watchers.append(create_watcher(
name=f"tritonserver_{runner.name}",
cmd=find_triton_binary(),
args=cli_args, use_sockets=False,
working_dir=working_dir, numprocesses=1, env=env,
))
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `RuntimeError: 'tritonserver' is not found on PATH` | Triton binary not installed or not in PATH | Use NVIDIA NGC Triton container as base image, or install tritonserver manually |
| `ImportError: tritonclient` | tritonclient package not installed | `pip install "bentoml[triton]"` |
Compatibility Notes
- macOS: `tritonclient[all]` is excluded on macOS (`sys_platform != 'darwin'`). Only the base `tritonclient` is available.
- Container usage: The recommended approach is to use the official NVIDIA Triton container image as the `base_image` in your BentoML Image configuration.
- Protocol: Triton runners can use either HTTP or gRPC protocol for communication with the BentoML API server.