Principle:Bentoml BentoML HTTP Production Serving

**Metadata**
Knowledge Sources	BentoML BentoML Serving
Domains	ML_Serving Production_Infrastructure
Last Updated	2026-02-13 15:00 GMT

Overview

A design pattern for running a BentoML service as a multi-process production HTTP server. Production serving uses a process supervisor (Circus) to manage multiple Uvicorn worker processes behind a shared socket, providing fault tolerance, graceful restarts, and resource isolation.

Description

BentoML's production serving architecture is built around a multi-process model where a supervisor process (powered by the Circus library) orchestrates one or more Uvicorn ASGI worker processes. This design addresses the fundamental challenges of production ML serving:

Fault tolerance -- if a worker process crashes (e.g., due to an out-of-memory error during inference), the supervisor automatically restarts it without affecting other workers or dropping the shared listening socket.
Concurrency -- multiple worker processes can handle requests in parallel, utilizing multiple CPU cores. Each worker runs its own copy of the model and event loop.
Resource isolation -- when a service has dependencies on other services, each service in the dependency graph runs in its own worker pool with dedicated resources.
Graceful lifecycle -- the supervisor handles SIGTERM/SIGINT signals, draining in-flight requests before shutting down workers.

The production server supports:

SSL termination via ssl_certfile and ssl_keyfile parameters.
Hot reload for development mode, automatically restarting workers when source files change.
Configurable worker counts via api_workers (defaulting to 1, typically set to the number of CPU cores for CPU-bound workloads).

The bentoml serve CLI command is the primary user-facing entry point, which internally calls the serve_http_production function.

Usage

Use the production HTTP server when:

Deploying a BentoML service to handle real traffic (as opposed to development/testing).
You need multi-process serving for throughput and fault tolerance.
You need SSL, custom port/host binding, or multi-service orchestration.

Typical invocation from the CLI:

# Single service
bentoml serve service:MyService --port 3000 --api-workers 4

# Development mode with hot reload
bentoml serve service:MyService --reload --development

Theoretical Basis

The production serving pattern applies the prefork worker model common in production web servers (e.g., Gunicorn, Nginx): a supervisor process forks multiple worker processes that share a listening socket via SO_REUSEPORT or file descriptor inheritance.

The abstract architecture is as follows:

PRODUCTION_SERVING(service_module, config):
    SUPERVISOR (Circus Arbiter):
        BIND socket on host:port (with optional SSL)
        FOR each service S in dependency_graph(service_module):
            SPAWN worker_pool(S, count=api_workers):
                EACH WORKER:
                    1. Import service module
                    2. Initialize Service[T] (run __init__, load models)
                    3. Start Uvicorn ASGI server on shared socket
                    4. Serve HTTP requests until SIGTERM

        MONITOR workers:
            IF worker crashes -> RESTART worker
            IF SIGTERM received -> GRACEFUL_DRAIN all workers -> EXIT

    REQUEST FLOW:
        Client -> TCP socket -> OS distributes to available worker
            -> Uvicorn -> ASGI app -> Service.method() -> Response

Key theoretical properties:

Process-level isolation -- each worker is a separate OS process with its own memory space; a crash in one worker cannot corrupt another.
Horizontal scaling -- increasing api_workers linearly increases throughput for CPU-bound workloads (up to available cores).
Zero-downtime restarts -- the supervisor can restart workers one at a time while others continue serving.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment