Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Workflow:Bentoml BentoML Multi Model Composition

From Leeroopedia
Knowledge Sources
Domains ML_Serving, Model_Composition, Distributed_Systems
Last Updated 2026-02-13 15:00 GMT

Overview

End-to-end process for composing multiple ML models into a unified inference pipeline using BentoML's distributed service architecture and dependency injection system.

Description

This workflow covers the construction of multi-model AI applications where several models collaborate to produce a combined result. BentoML's bentoml.depends() mechanism enables services to declare dependencies on other services, allowing automatic orchestration of complex inference graphs. Models can execute sequentially (pipeline), concurrently (ensemble), or in a hybrid pattern. Each service can be independently configured with its own resource requirements (CPU/GPU/memory), scaling policies, and worker counts.

Key capabilities covered:

  • Multi-service definition with independent resource allocation
  • Dependency injection via bentoml.depends()
  • Sequential (pipeline) model execution
  • Concurrent (ensemble) model execution using asyncio.gather
  • Inference graph patterns combining parallel and sequential steps
  • Automatic service discovery and inter-service communication

Usage

Execute this workflow when your AI application requires multiple models working together, such as preprocessing pipelines, ensemble predictions, RAG systems, AI agents, or multi-modal applications. This is also appropriate when different models have different hardware requirements (e.g., one model needs a GPU while another runs on CPU) and need independent scaling.

Execution Steps

Step 1: Design the Service Architecture

Plan the multi-model system by identifying which models are involved, their dependencies, resource requirements, and execution patterns. Determine whether models should run in the same service (shared hardware) or as separate services (independent scaling). Map out the data flow between models to identify sequential and parallel execution opportunities.

Key considerations:

  • Models on the same hardware with similar resources can share a service
  • Models requiring different GPU types or independent scaling need separate services
  • Identify which steps can run concurrently to maximize throughput
  • Diamond-shaped dependencies (multiple services depending on a shared service) are supported
  • Each service runs as a separate container in deployment

Step 2: Define Individual Model Services

Create a BentoML Service class for each independently-scalable model component. Each service loads its own model, defines its own API endpoints, and specifies its own resource requirements. Services are self-contained units that can be developed and tested independently.

Key considerations:

  • Each service gets its own @bentoml.service decorator with resource configuration
  • Model loading happens in the constructor (__init__)
  • API methods define the interface that other services will call
  • Services can be in the same file or different modules
  • Each service is tested individually before composition

Step 3: Wire Dependencies with bentoml.depends

In the orchestrator service, declare dependencies on other services using bentoml.depends(ServiceClass) as class-level attributes. This creates a direct communication channel between services. BentoML handles service discovery, request routing, and payload serialization automatically.

Key considerations:

  • bentoml.depends() accepts a service class, deployment name, or URL
  • Dependencies are declared as class attributes, not in __init__
  • The dependency object behaves like a local instance of the service
  • Calling methods on dependencies triggers inter-service communication
  • External deployments (already running) can be referenced by URL or deployment name

Step 4: Implement Execution Patterns

Build the orchestration logic in the main service's API methods. For sequential execution, call dependent services one after another, passing outputs as inputs. For concurrent execution, use asyncio.gather with the .to_async property to run multiple service calls in parallel. Complex inference graphs combine both patterns.

What happens:

  • Sequential: output_a = service_a.process(input) then service_b.process(output_a)
  • Concurrent: result_a, result_b = await asyncio.gather(svc_a.to_async.run(x), svc_b.to_async.run(x))
  • The .to_async property converts synchronous methods to asynchronous for non-blocking execution
  • Results from parallel operations can be aggregated, filtered, or chained further

Step 5: Configure Distributed Deployment

Prepare a deployment configuration that specifies per-service settings including instance types, scaling policies, and environment variables. For BentoCloud deployments, use a YAML configuration file that maps each service to its desired resources. For Docker/Kubernetes, each service runs in its own container.

Key considerations:

  • A YAML config file maps each service name to its deployment settings
  • Each service can have different instance types, replica counts, and environment variables
  • Deploy with bentoml deploy -f config.yaml
  • On BentoCloud, services auto-discover each other within the same deployment
  • Locally, bentoml serve runs all services in a single process for development

Step 6: Test the Composed Pipeline

Verify the full pipeline works correctly by testing both locally (with bentoml serve) and in the target deployment environment. Test individual service endpoints and the composed endpoint to ensure correct data flow, error propagation, and performance characteristics.

Key considerations:

  • Local serving runs all services in one process for easy debugging
  • Test edge cases like timeout propagation and error handling across services
  • Monitor inter-service latency in production
  • The BentoML client works the same way for both single and composed services

Execution Diagram

GitHub URL

Workflow Repository