Workflow:Bentoml BentoML Multi Model Composition

Knowledge Sources	BentoML BentoML Docs Model Composition Guide Distributed Services Guide
Domains	ML_Serving, Model_Composition, Distributed_Systems
Last Updated	2026-02-13 15:00 GMT

Overview

End-to-end process for composing multiple ML models into a unified inference pipeline using BentoML's distributed service architecture and dependency injection system.

Description

This workflow covers the construction of multi-model AI applications where several models collaborate to produce a combined result. BentoML's bentoml.depends() mechanism enables services to declare dependencies on other services, allowing automatic orchestration of complex inference graphs. Models can execute sequentially (pipeline), concurrently (ensemble), or in a hybrid pattern. Each service can be independently configured with its own resource requirements (CPU/GPU/memory), scaling policies, and worker counts.

Key capabilities covered:

Multi-service definition with independent resource allocation
Dependency injection via bentoml.depends()
Sequential (pipeline) model execution
Concurrent (ensemble) model execution using asyncio.gather
Inference graph patterns combining parallel and sequential steps
Automatic service discovery and inter-service communication

Usage

Execute this workflow when your AI application requires multiple models working together, such as preprocessing pipelines, ensemble predictions, RAG systems, AI agents, or multi-modal applications. This is also appropriate when different models have different hardware requirements (e.g., one model needs a GPU while another runs on CPU) and need independent scaling.

Execution Steps

Step 1: Design the Service Architecture

Plan the multi-model system by identifying which models are involved, their dependencies, resource requirements, and execution patterns. Determine whether models should run in the same service (shared hardware) or as separate services (independent scaling). Map out the data flow between models to identify sequential and parallel execution opportunities.

Key considerations:

Models on the same hardware with similar resources can share a service
Models requiring different GPU types or independent scaling need separate services
Identify which steps can run concurrently to maximize throughput
Diamond-shaped dependencies (multiple services depending on a shared service) are supported
Each service runs as a separate container in deployment

Step 2: Define Individual Model Services

Create a BentoML Service class for each independently-scalable model component. Each service loads its own model, defines its own API endpoints, and specifies its own resource requirements. Services are self-contained units that can be developed and tested independently.

Key considerations:

Each service gets its own @bentoml.service decorator with resource configuration
Model loading happens in the constructor (__init__)
API methods define the interface that other services will call
Services can be in the same file or different modules
Each service is tested individually before composition

Step 3: Wire Dependencies with bentoml.depends

In the orchestrator service, declare dependencies on other services using bentoml.depends(ServiceClass) as class-level attributes. This creates a direct communication channel between services. BentoML handles service discovery, request routing, and payload serialization automatically.

Key considerations:

bentoml.depends() accepts a service class, deployment name, or URL
Dependencies are declared as class attributes, not in __init__
The dependency object behaves like a local instance of the service
Calling methods on dependencies triggers inter-service communication
External deployments (already running) can be referenced by URL or deployment name

Step 4: Implement Execution Patterns

Build the orchestration logic in the main service's API methods. For sequential execution, call dependent services one after another, passing outputs as inputs. For concurrent execution, use asyncio.gather with the .to_async property to run multiple service calls in parallel. Complex inference graphs combine both patterns.

What happens:

Sequential: output_a = service_a.process(input) then service_b.process(output_a)
Concurrent: result_a, result_b = await asyncio.gather(svc_a.to_async.run(x), svc_b.to_async.run(x))
The .to_async property converts synchronous methods to asynchronous for non-blocking execution
Results from parallel operations can be aggregated, filtered, or chained further

Step 5: Configure Distributed Deployment

Prepare a deployment configuration that specifies per-service settings including instance types, scaling policies, and environment variables. For BentoCloud deployments, use a YAML configuration file that maps each service to its desired resources. For Docker/Kubernetes, each service runs in its own container.

Key considerations:

A YAML config file maps each service name to its deployment settings
Each service can have different instance types, replica counts, and environment variables
Deploy with bentoml deploy -f config.yaml
On BentoCloud, services auto-discover each other within the same deployment
Locally, bentoml serve runs all services in a single process for development

Step 6: Test the Composed Pipeline

Verify the full pipeline works correctly by testing both locally (with bentoml serve) and in the target deployment environment. Test individual service endpoints and the composed endpoint to ensure correct data flow, error propagation, and performance characteristics.

Key considerations:

Local serving runs all services in one process for easy debugging
Test edge cases like timeout propagation and error handling across services
Monitor inter-service latency in production
The BentoML client works the same way for both single and composed services

Execution Diagram

GitHub URL

Workflow Repository