Workflow:Bentoml BentoML Multi Model Composition
| Knowledge Sources | |
|---|---|
| Domains | ML_Serving, Model_Composition, Distributed_Systems |
| Last Updated | 2026-02-13 15:00 GMT |
Overview
End-to-end process for composing multiple ML models into a unified inference pipeline using BentoML's distributed service architecture and dependency injection system.
Description
This workflow covers the construction of multi-model AI applications where several models collaborate to produce a combined result. BentoML's bentoml.depends() mechanism enables services to declare dependencies on other services, allowing automatic orchestration of complex inference graphs. Models can execute sequentially (pipeline), concurrently (ensemble), or in a hybrid pattern. Each service can be independently configured with its own resource requirements (CPU/GPU/memory), scaling policies, and worker counts.
Key capabilities covered:
- Multi-service definition with independent resource allocation
- Dependency injection via bentoml.depends()
- Sequential (pipeline) model execution
- Concurrent (ensemble) model execution using asyncio.gather
- Inference graph patterns combining parallel and sequential steps
- Automatic service discovery and inter-service communication
Usage
Execute this workflow when your AI application requires multiple models working together, such as preprocessing pipelines, ensemble predictions, RAG systems, AI agents, or multi-modal applications. This is also appropriate when different models have different hardware requirements (e.g., one model needs a GPU while another runs on CPU) and need independent scaling.
Execution Steps
Step 1: Design the Service Architecture
Plan the multi-model system by identifying which models are involved, their dependencies, resource requirements, and execution patterns. Determine whether models should run in the same service (shared hardware) or as separate services (independent scaling). Map out the data flow between models to identify sequential and parallel execution opportunities.
Key considerations:
- Models on the same hardware with similar resources can share a service
- Models requiring different GPU types or independent scaling need separate services
- Identify which steps can run concurrently to maximize throughput
- Diamond-shaped dependencies (multiple services depending on a shared service) are supported
- Each service runs as a separate container in deployment
Step 2: Define Individual Model Services
Create a BentoML Service class for each independently-scalable model component. Each service loads its own model, defines its own API endpoints, and specifies its own resource requirements. Services are self-contained units that can be developed and tested independently.
Key considerations:
- Each service gets its own @bentoml.service decorator with resource configuration
- Model loading happens in the constructor (__init__)
- API methods define the interface that other services will call
- Services can be in the same file or different modules
- Each service is tested individually before composition
Step 3: Wire Dependencies with bentoml.depends
In the orchestrator service, declare dependencies on other services using bentoml.depends(ServiceClass) as class-level attributes. This creates a direct communication channel between services. BentoML handles service discovery, request routing, and payload serialization automatically.
Key considerations:
- bentoml.depends() accepts a service class, deployment name, or URL
- Dependencies are declared as class attributes, not in __init__
- The dependency object behaves like a local instance of the service
- Calling methods on dependencies triggers inter-service communication
- External deployments (already running) can be referenced by URL or deployment name
Step 4: Implement Execution Patterns
Build the orchestration logic in the main service's API methods. For sequential execution, call dependent services one after another, passing outputs as inputs. For concurrent execution, use asyncio.gather with the .to_async property to run multiple service calls in parallel. Complex inference graphs combine both patterns.
What happens:
- Sequential: output_a = service_a.process(input) then service_b.process(output_a)
- Concurrent: result_a, result_b = await asyncio.gather(svc_a.to_async.run(x), svc_b.to_async.run(x))
- The .to_async property converts synchronous methods to asynchronous for non-blocking execution
- Results from parallel operations can be aggregated, filtered, or chained further
Step 5: Configure Distributed Deployment
Prepare a deployment configuration that specifies per-service settings including instance types, scaling policies, and environment variables. For BentoCloud deployments, use a YAML configuration file that maps each service to its desired resources. For Docker/Kubernetes, each service runs in its own container.
Key considerations:
- A YAML config file maps each service name to its deployment settings
- Each service can have different instance types, replica counts, and environment variables
- Deploy with bentoml deploy -f config.yaml
- On BentoCloud, services auto-discover each other within the same deployment
- Locally, bentoml serve runs all services in a single process for development
Step 6: Test the Composed Pipeline
Verify the full pipeline works correctly by testing both locally (with bentoml serve) and in the target deployment environment. Test individual service endpoints and the composed endpoint to ensure correct data flow, error propagation, and performance characteristics.
Key considerations:
- Local serving runs all services in one process for easy debugging
- Test edge cases like timeout propagation and error handling across services
- Monitor inter-service latency in production
- The BentoML client works the same way for both single and composed services