Implementation:Bentoml BentoML Framework ONNX
| Knowledge Sources | |
|---|---|
| Domains | ML Framework, Model Interoperability, Model Serialization |
| Last Updated | 2026-02-13 15:00 GMT |
Overview
The bentoml.onnx module provides BentoML integration for ONNX (Open Neural Network Exchange) models, enabling save, load, and serving via ONNX Runtime inference sessions.
Description
This module implements the BentoML framework adapter for ONNX models. Models are saved as .onnx protobuf files and loaded using onnxruntime.InferenceSession for high-performance inference.
ModelOptions (attrs dataclass) includes:
input_specs: Dictionary mapping method names to input specifications (extracted from the ONNX graph).output_specs: Dictionary mapping method names to output specifications.providers: ONNX Runtime execution providers (e.g., CUDAExecutionProvider, CPUExecutionProvider).session_options: ONNX Runtime session configuration.
Key implementation details:
- save_model(): Saves an
onnx.ModelPrototo the model store. Automatically extracts input/output specifications from the ONNX graph. Only therunmethod name is allowed in signatures since ONNX Runtime usesInferenceSession.run(). - load_model(): Creates an
ort.InferenceSessionwith configurable providers. Defaults to CPUExecutionProvider if no providers are specified. - get_runnable(): Creates an
ONNXRunnablethat handles GPU/CPU provider selection, CPU parallelization configuration (intra/inter op thread counts), automatic input type casting via generated casting functions, and output tuple handling for multi-output models. Includes backward compatibility for v1 API models.
Usage
Use this module to save and serve ONNX-format models (exported from PyTorch, TensorFlow, scikit-learn, etc.) within BentoML services. Ideal for cross-framework model deployment.
Code Reference
Source Location
- Repository: Bentoml_BentoML
- File: src/bentoml/_internal/frameworks/onnx.py
- Lines: 1-444
Signature
def get(tag_like: str | Tag) -> bentoml.Model: ...
def load_model(bento_model: str | Tag | bentoml.Model,
*, providers: ProvidersType | None = None,
session_options: ort.SessionOptions | None = None
) -> ort.InferenceSession: ...
def save_model(name: Tag | str,
model: onnx.ModelProto,
*, signatures: dict | None = None,
labels: dict[str, str] | None = None,
custom_objects: dict[str, Any] | None = None,
external_modules: List[ModuleType] | None = None,
metadata: dict[str, Any] | None = None
) -> bentoml.Model: ...
def get_runnable(bento_model: bentoml.Model) -> type[bentoml.legacy.Runnable]: ...
Import
import bentoml
# Via public API
model = bentoml.onnx.save_model(...)
session = bentoml.onnx.load_model(...)
I/O Contract
Inputs
save_model()
| Name | Type | Required | Description |
|---|---|---|---|
| name | Tag or str | Yes | Name/tag for the model in the BentoML store |
| model | onnx.ModelProto | Yes | The ONNX model protobuf to save |
| signatures | dict or None | No | Inference method signatures (default: {"run": {"batchable": False}}). Only "run" is allowed. |
| labels | dict[str, str] or None | No | User-defined labels for model management |
| custom_objects | dict[str, Any] or None | No | Additional objects to serialize |
| external_modules | List[ModuleType] or None | No | Additional Python modules to save alongside |
| metadata | dict[str, Any] or None | No | Custom metadata for the model |
load_model()
| Name | Type | Required | Description |
|---|---|---|---|
| bento_model | str, Tag, or Model | Yes | Tag or Model instance to load from the store |
| providers | ProvidersType or None | No | ONNX Runtime execution providers (default: ["CPUExecutionProvider"]) |
| session_options | ort.SessionOptions or None | No | ONNX Runtime session configuration |
Outputs
| Method | Return Type | Description |
|---|---|---|
| save_model() | bentoml.Model | A BentoML Model containing the saved ONNX model |
| load_model() | ort.InferenceSession | An ONNX Runtime inference session |
| get() | bentoml.Model | The BentoML Model reference from the store |
| get_runnable() | type[Runnable] | An ONNXRunnable class with automatic input casting and provider selection |
Usage Examples
import bentoml
import torch
import torch.nn as nn
import onnx
# Export a PyTorch model to ONNX
class SimpleModel(nn.Module):
def __init__(self, D_in, H, D_out):
super().__init__()
self.linear1 = nn.Linear(D_in, H)
self.linear2 = nn.Linear(H, D_out)
def forward(self, x):
return self.linear2(self.linear1(x).clamp(min=0))
model = SimpleModel(1000, 100, 1)
x = torch.randn(64, 1000)
torch.onnx.export(model, x, "/tmp/model.onnx",
input_names=["x"], output_names=["output"])
# Save ONNX model to BentoML
onnx_model = onnx.load("/tmp/model.onnx")
bento_model = bentoml.onnx.save_model(
"onnx_model", onnx_model,
signatures={"run": {"batchable": True}}
)
# Load and run inference
sess = bentoml.onnx.load_model("onnx_model:latest")
import numpy as np
result = sess.run(None, {"x": np.random.randn(1, 1000).astype(np.float32)})