Workflow:Bentoml BentoML Service Definition And Local Serving
| Knowledge Sources | |
|---|---|
| Domains | ML_Serving, API_Development, Model_Inference |
| Last Updated | 2026-02-13 15:00 GMT |
Overview
End-to-end process for defining a BentoML Service class that wraps an ML model and serving it locally as a production-grade HTTP API.
Description
This workflow covers the foundational BentoML use case: transforming a trained ML model into a deployable inference API using the Python-native Service abstraction. It starts with installing BentoML and required ML framework dependencies, proceeds through defining a Service class with the @bentoml.service decorator, annotating inference methods with @bentoml.api, and concludes with running the service locally via bentoml serve. The resulting server exposes REST API endpoints with automatic request validation, OpenAPI documentation, and Swagger UI.
Key capabilities covered:
- Service class definition with resource configuration
- API endpoint declaration with type-safe input/output
- HuggingFace model loading via HuggingFaceModel
- Local HTTP serving with auto-reload support
- Client-side invocation via SyncHTTPClient
Usage
Execute this workflow when you have a trained ML model (from any framework such as PyTorch, TensorFlow, scikit-learn, or Transformers) and need to expose it as a REST API for development, testing, or local inference. This is the starting point for all BentoML projects and the prerequisite for packaging, containerization, and cloud deployment.
Execution Steps
Step 1: Environment Setup
Install BentoML and the ML framework dependencies required by your model. BentoML requires Python 3.9 or higher. Create and activate a virtual environment, then install BentoML alongside your model's dependencies (e.g., torch, transformers, scikit-learn).
Key considerations:
- Use a virtual environment for dependency isolation
- BentoML supports Python 3.9+
- Install ML framework packages that your model requires for inference
Step 2: Define the Service Class
Create a service.py file containing a Python class decorated with @bentoml.service. The decorator transforms a regular Python class into a BentoML Service with lifecycle management, resource configuration, and serving capabilities. Configure service-level settings such as resource requirements (CPU, memory, GPU), traffic timeouts, and worker counts through decorator parameters.
Key considerations:
- The class name determines the service name (auto-lowercased for the endpoint)
- Resource and traffic configurations are specified as decorator keyword arguments
- The image parameter can define runtime environment (Python version, packages) for later containerization
- Each Service class manages its own lifecycle and state
Step 3: Load the Model
Within the Service class constructor (__init__), load the ML model into memory. BentoML provides bentoml.models.HuggingFaceModel for loading models from HuggingFace Hub and bentoml.models.BentoModel for loading from the local Model Store. The model reference must be declared as a class-level attribute so BentoML can track it as a dependency.
Key considerations:
- Declare model references at the class level (not inside __init__) for proper dependency tracking
- HuggingFaceModel returns the downloaded model path as a string
- BentoModel loads from the local Model Store using tag-based versioning
- Models are loaded once during service initialization and reused across requests
Step 4: Define API Endpoints
Annotate inference methods with the @bentoml.api decorator. Each decorated method becomes an HTTP endpoint. BentoML uses Python type hints to automatically generate request/response schemas, input validation, and OpenAPI documentation. Configure batching behavior, custom routes, and input/output specifications through decorator parameters.
Key considerations:
- Method parameters become the request schema (uses Python type hints)
- Return type annotations define the response format
- The batchable flag enables adaptive batching for throughput optimization
- Custom routes can override the default method-name-based URL path
- Both synchronous and asynchronous (async/await) methods are supported
Step 5: Serve Locally
Run bentoml serve from the directory containing the service.py file. This starts a production-grade HTTP server (based on Uvicorn/Starlette) that exposes all defined API endpoints. The server includes a built-in Swagger UI for interactive testing, health check endpoints, and optional hot-reload during development.
Key considerations:
- Default address is http://localhost:3000
- Use --reload flag for auto-reload during development
- The server exposes /docs for Swagger UI and /healthz for health checks
- Specify the service module and class with bentoml serve module:ClassName format
- Development mode can be enabled with the --development flag
Step 6: Test the API
Invoke the running service using the built-in BentoML client, curl commands, or the Swagger UI. BentoML provides SyncHTTPClient and AsyncHTTPClient for programmatic access. The client automatically maps API method names to HTTP endpoints and handles serialization.
Key considerations:
- SyncHTTPClient provides synchronous blocking calls
- AsyncHTTPClient provides async non-blocking calls
- The Swagger UI at the service URL provides interactive testing
- Standard HTTP tools (curl, requests, httpx) work directly with the REST API