Principle:Bentoml BentoML HTTP Production Serving
| Knowledge Sources | |
|---|---|
| Domains | |
| Last Updated | 2026-02-13 15:00 GMT |
Overview
A design pattern for running a BentoML service as a multi-process production HTTP server. Production serving uses a process supervisor (Circus) to manage multiple Uvicorn worker processes behind a shared socket, providing fault tolerance, graceful restarts, and resource isolation.
Description
BentoML's production serving architecture is built around a multi-process model where a supervisor process (powered by the Circus library) orchestrates one or more Uvicorn ASGI worker processes. This design addresses the fundamental challenges of production ML serving:
- Fault tolerance -- if a worker process crashes (e.g., due to an out-of-memory error during inference), the supervisor automatically restarts it without affecting other workers or dropping the shared listening socket.
- Concurrency -- multiple worker processes can handle requests in parallel, utilizing multiple CPU cores. Each worker runs its own copy of the model and event loop.
- Resource isolation -- when a service has dependencies on other services, each service in the dependency graph runs in its own worker pool with dedicated resources.
- Graceful lifecycle -- the supervisor handles SIGTERM/SIGINT signals, draining in-flight requests before shutting down workers.
The production server supports:
- SSL termination via
ssl_certfileandssl_keyfileparameters. - Hot reload for development mode, automatically restarting workers when source files change.
- Configurable worker counts via
api_workers(defaulting to 1, typically set to the number of CPU cores for CPU-bound workloads).
The bentoml serve CLI command is the primary user-facing entry point, which internally calls the serve_http_production function.
Usage
Use the production HTTP server when:
- Deploying a BentoML service to handle real traffic (as opposed to development/testing).
- You need multi-process serving for throughput and fault tolerance.
- You need SSL, custom port/host binding, or multi-service orchestration.
Typical invocation from the CLI:
# Single service
bentoml serve service:MyService --port 3000 --api-workers 4
# Development mode with hot reload
bentoml serve service:MyService --reload --development
Theoretical Basis
The production serving pattern applies the prefork worker model common in production web servers (e.g., Gunicorn, Nginx): a supervisor process forks multiple worker processes that share a listening socket via SO_REUSEPORT or file descriptor inheritance.
The abstract architecture is as follows:
PRODUCTION_SERVING(service_module, config):
SUPERVISOR (Circus Arbiter):
BIND socket on host:port (with optional SSL)
FOR each service S in dependency_graph(service_module):
SPAWN worker_pool(S, count=api_workers):
EACH WORKER:
1. Import service module
2. Initialize Service[T] (run __init__, load models)
3. Start Uvicorn ASGI server on shared socket
4. Serve HTTP requests until SIGTERM
MONITOR workers:
IF worker crashes -> RESTART worker
IF SIGTERM received -> GRACEFUL_DRAIN all workers -> EXIT
REQUEST FLOW:
Client -> TCP socket -> OS distributes to available worker
-> Uvicorn -> ASGI app -> Service.method() -> Response
Key theoretical properties:
- Process-level isolation -- each worker is a separate OS process with its own memory space; a crash in one worker cannot corrupt another.
- Horizontal scaling -- increasing
api_workerslinearly increases throughput for CPU-bound workloads (up to available cores). - Zero-downtime restarts -- the supervisor can restart workers one at a time while others continue serving.