Implementation:Bentoml BentoML Serve Multi Service
Overview
Serve Multi Service extends BentoML's single-service serve_http_production to orchestrate all services in a dependency graph as separate worker processes. It uses all_services() for recursive dependency discovery and Circus as the process supervisor, enabling end-to-end local testing of multi-model compositions.
Code Reference
Import
# CLI usage (primary):
# bentoml serve service:Pipeline
# Python API:
from bentoml.serving import serve_http_production
Source Locations
src/bentoml/serving.py:L310-556--serve_http_productionfunction with multi-service orchestrationsrc/_bentoml_sdk/service/factory.py:L251-273--all_services()method for dependency graph discovery
all_services() Method
The all_services() method on a service class recursively discovers all services in the dependency graph.
# From src/_bentoml_sdk/service/factory.py:L251-273
class Service:
def all_services(self) -> dict[str, Service]:
"""Recursively discover all services in the dependency graph.
Returns a dict mapping service name to service instance,
including self and all transitive dependencies.
"""
services = {self.name: self}
for dep in self.dependencies.values():
dep_service = dep.on
if dep_service is not None and dep_service.name not in services:
services.update(dep_service.all_services())
return services
Multi-Service Orchestration Flow
When bentoml serve is invoked with a composed service, the following steps occur:
- Entry service loading: The specified module path is loaded and the entry service class is resolved.
- Dependency discovery:
svc.all_services()is called to recursively discover all services in the dependency graph. - Bind map creation: A
runner_bind_mapis created that maps each service name to a local address (Unix domain socket path or TCP port). - Worker process spawning: For each service in the graph, Circus spawns separate worker processes with the appropriate resource configuration.
- Inter-process communication setup: Each worker is configured with the bind map so it can discover and connect to its dependencies.
- Entry service exposure: Only the entry service's HTTP port is exposed externally; dependent services communicate via internal IPC channels.
Key Differences from Single-Service Mode
| Aspect | Single Service | Multi Service |
|---|---|---|
| Process count | One set of workers for one service | Separate worker set per service in the graph |
| Discovery | No discovery needed | all_services() traverses full dependency graph
|
| Communication | In-process only | Inter-process via Unix domain sockets or TCP |
| Service binding | Single HTTP port | HTTP port for entry service + internal IPC for dependencies |
| Process supervision | Circus manages one service | Circus manages all services as a process tree |
CLI Usage
# Serve a composed pipeline -- all dependent services are automatically discovered and started
bentoml serve service:Pipeline
# With explicit host and port
bentoml serve service:Pipeline --host 0.0.0.0 --port 3000
# With development mode (auto-reload on code changes)
bentoml serve service:Pipeline --reload
Python API Usage
from bentoml.serving import serve_http_production
# Programmatically start the multi-service server
serve_http_production(
"service:Pipeline",
host="0.0.0.0",
port=3000,
)
Inputs and Outputs
Inputs:
- Entry service module path (e.g.,
"service:Pipeline") -- the top-level composed service - Optional: host, port, number of workers, reload flag
Outputs:
- Multi-process server with:
- Separate worker processes for each service in the dependency graph
- Inter-process communication channels between dependent services
- A single externally-accessible HTTP endpoint for the entry service
- Circus-supervised process tree for health monitoring and restart
Architecture Diagram
bentoml serve service:Pipeline
|
v
+-- Circus Supervisor --+
| |
| +-- Pipeline (HTTP :3000)
| | depends on:
| | Preprocessor (IPC)
| | InferenceModel (IPC)
| | Postprocessor (IPC)
| |
| +-- Preprocessor Workers
| | (Unix socket /tmp/bento_preprocess.sock)
| |
| +-- InferenceModel Workers
| | (Unix socket /tmp/bento_model.sock)
| |
| +-- Postprocessor Workers
| (Unix socket /tmp/bento_postprocess.sock)
|
+------------------------+
Source Files
src/bentoml/serving.py:L310-556-- Main serving orchestration with multi-service supportsrc/_bentoml_sdk/service/factory.py:L251-273--all_services()dependency graph discovery
Relationship to Principle
This implementation realizes the Composed Pipeline Testing principle by providing a local multi-process serving mode that mirrors the distributed production topology, enabling end-to-end testing of composed pipelines on a single machine.
Principle:Bentoml_BentoML_Composed_Pipeline_Testing