Workflow:Bentoml BentoML Service Definition And Local Serving

Knowledge Sources	BentoML BentoML Docs Hello World Guide
Domains	ML_Serving, API_Development, Model_Inference
Last Updated	2026-02-13 15:00 GMT

Overview

End-to-end process for defining a BentoML Service class that wraps an ML model and serving it locally as a production-grade HTTP API.

Description

This workflow covers the foundational BentoML use case: transforming a trained ML model into a deployable inference API using the Python-native Service abstraction. It starts with installing BentoML and required ML framework dependencies, proceeds through defining a Service class with the @bentoml.service decorator, annotating inference methods with @bentoml.api, and concludes with running the service locally via bentoml serve. The resulting server exposes REST API endpoints with automatic request validation, OpenAPI documentation, and Swagger UI.

Key capabilities covered:

Service class definition with resource configuration
API endpoint declaration with type-safe input/output
HuggingFace model loading via HuggingFaceModel
Local HTTP serving with auto-reload support
Client-side invocation via SyncHTTPClient

Usage

Execute this workflow when you have a trained ML model (from any framework such as PyTorch, TensorFlow, scikit-learn, or Transformers) and need to expose it as a REST API for development, testing, or local inference. This is the starting point for all BentoML projects and the prerequisite for packaging, containerization, and cloud deployment.

Execution Steps

Step 1: Environment Setup

Install BentoML and the ML framework dependencies required by your model. BentoML requires Python 3.9 or higher. Create and activate a virtual environment, then install BentoML alongside your model's dependencies (e.g., torch, transformers, scikit-learn).

Key considerations:

Use a virtual environment for dependency isolation
BentoML supports Python 3.9+
Install ML framework packages that your model requires for inference

Step 2: Define the Service Class

Create a service.py file containing a Python class decorated with @bentoml.service. The decorator transforms a regular Python class into a BentoML Service with lifecycle management, resource configuration, and serving capabilities. Configure service-level settings such as resource requirements (CPU, memory, GPU), traffic timeouts, and worker counts through decorator parameters.

Key considerations:

The class name determines the service name (auto-lowercased for the endpoint)
Resource and traffic configurations are specified as decorator keyword arguments
The image parameter can define runtime environment (Python version, packages) for later containerization
Each Service class manages its own lifecycle and state

Step 3: Load the Model

Within the Service class constructor (__init__), load the ML model into memory. BentoML provides bentoml.models.HuggingFaceModel for loading models from HuggingFace Hub and bentoml.models.BentoModel for loading from the local Model Store. The model reference must be declared as a class-level attribute so BentoML can track it as a dependency.

Key considerations:

Declare model references at the class level (not inside __init__) for proper dependency tracking
HuggingFaceModel returns the downloaded model path as a string
BentoModel loads from the local Model Store using tag-based versioning
Models are loaded once during service initialization and reused across requests

Step 4: Define API Endpoints

Annotate inference methods with the @bentoml.api decorator. Each decorated method becomes an HTTP endpoint. BentoML uses Python type hints to automatically generate request/response schemas, input validation, and OpenAPI documentation. Configure batching behavior, custom routes, and input/output specifications through decorator parameters.

Key considerations:

Method parameters become the request schema (uses Python type hints)
Return type annotations define the response format
The batchable flag enables adaptive batching for throughput optimization
Custom routes can override the default method-name-based URL path
Both synchronous and asynchronous (async/await) methods are supported

Step 5: Serve Locally

Run bentoml serve from the directory containing the service.py file. This starts a production-grade HTTP server (based on Uvicorn/Starlette) that exposes all defined API endpoints. The server includes a built-in Swagger UI for interactive testing, health check endpoints, and optional hot-reload during development.

Key considerations:

Default address is http://localhost:3000
Use --reload flag for auto-reload during development
The server exposes /docs for Swagger UI and /healthz for health checks
Specify the service module and class with bentoml serve module:ClassName format
Development mode can be enabled with the --development flag

Step 6: Test the API

Invoke the running service using the built-in BentoML client, curl commands, or the Swagger UI. BentoML provides SyncHTTPClient and AsyncHTTPClient for programmatic access. The client automatically maps API method names to HTTP endpoints and handles serialization.

Key considerations:

SyncHTTPClient provides synchronous blocking calls
AsyncHTTPClient provides async non-blocking calls
The Swagger UI at the service URL provides interactive testing
Standard HTTP tools (curl, requests, httpx) work directly with the REST API

Execution Diagram

GitHub URL

Workflow Repository