Implementation:Bentoml BentoML Serve Multi Service

Overview

Serve Multi Service extends BentoML's single-service serve_http_production to orchestrate all services in a dependency graph as separate worker processes. It uses all_services() for recursive dependency discovery and Circus as the process supervisor, enabling end-to-end local testing of multi-model compositions.

Code Reference

Import

# CLI usage (primary):
# bentoml serve service:Pipeline

# Python API:
from bentoml.serving import serve_http_production

Source Locations

src/bentoml/serving.py:L310-556 -- serve_http_production function with multi-service orchestration
src/_bentoml_sdk/service/factory.py:L251-273 -- all_services() method for dependency graph discovery

`all_services()` Method

The all_services() method on a service class recursively discovers all services in the dependency graph.

# From src/_bentoml_sdk/service/factory.py:L251-273
class Service:
    def all_services(self) -> dict[str, Service]:
        """Recursively discover all services in the dependency graph.

        Returns a dict mapping service name to service instance,
        including self and all transitive dependencies.
        """
        services = {self.name: self}
        for dep in self.dependencies.values():
            dep_service = dep.on
            if dep_service is not None and dep_service.name not in services:
                services.update(dep_service.all_services())
        return services

Multi-Service Orchestration Flow

When bentoml serve is invoked with a composed service, the following steps occur:

Entry service loading: The specified module path is loaded and the entry service class is resolved.
Dependency discovery: svc.all_services() is called to recursively discover all services in the dependency graph.
Bind map creation: A runner_bind_map is created that maps each service name to a local address (Unix domain socket path or TCP port).
Worker process spawning: For each service in the graph, Circus spawns separate worker processes with the appropriate resource configuration.
Inter-process communication setup: Each worker is configured with the bind map so it can discover and connect to its dependencies.
Entry service exposure: Only the entry service's HTTP port is exposed externally; dependent services communicate via internal IPC channels.

Key Differences from Single-Service Mode

Aspect	Single Service	Multi Service
Process count	One set of workers for one service	Separate worker set per service in the graph
Discovery	No discovery needed	`all_services()` traverses full dependency graph
Communication	In-process only	Inter-process via Unix domain sockets or TCP
Service binding	Single HTTP port	HTTP port for entry service + internal IPC for dependencies
Process supervision	Circus manages one service	Circus manages all services as a process tree

CLI Usage

# Serve a composed pipeline -- all dependent services are automatically discovered and started
bentoml serve service:Pipeline

# With explicit host and port
bentoml serve service:Pipeline --host 0.0.0.0 --port 3000

# With development mode (auto-reload on code changes)
bentoml serve service:Pipeline --reload

Python API Usage

from bentoml.serving import serve_http_production

# Programmatically start the multi-service server
serve_http_production(
    "service:Pipeline",
    host="0.0.0.0",
    port=3000,
)

Inputs and Outputs

Inputs:

Entry service module path (e.g., "service:Pipeline") -- the top-level composed service
Optional: host, port, number of workers, reload flag

Outputs:

Multi-process server with:
- Separate worker processes for each service in the dependency graph
- Inter-process communication channels between dependent services
- A single externally-accessible HTTP endpoint for the entry service
- Circus-supervised process tree for health monitoring and restart

Architecture Diagram

bentoml serve service:Pipeline
        |
        v
  +-- Circus Supervisor --+
  |                        |
  |  +-- Pipeline (HTTP :3000)
  |  |     depends on:
  |  |       Preprocessor (IPC)
  |  |       InferenceModel (IPC)
  |  |       Postprocessor (IPC)
  |  |
  |  +-- Preprocessor Workers
  |  |     (Unix socket /tmp/bento_preprocess.sock)
  |  |
  |  +-- InferenceModel Workers
  |  |     (Unix socket /tmp/bento_model.sock)
  |  |
  |  +-- Postprocessor Workers
  |        (Unix socket /tmp/bento_postprocess.sock)
  |
  +------------------------+

Source Files

src/bentoml/serving.py:L310-556 -- Main serving orchestration with multi-service support
src/_bentoml_sdk/service/factory.py:L251-273 -- all_services() dependency graph discovery

Relationship to Principle

This implementation realizes the Composed Pipeline Testing principle by providing a local multi-process serving mode that mirrors the distributed production topology, enabling end-to-end testing of composed pipelines on a single machine.

Principle:Bentoml_BentoML_Composed_Pipeline_Testing

Metadata

Knowledge Sources

2026-02-13 15:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment