Principle:Bentoml BentoML Service Architecture Design
Overview
Service Architecture Design addresses the challenge of building complex ML applications that require multiple models working together. Rather than monolithic inference endpoints, this principle advocates decomposing multi-model inference pipelines into service dependency graphs where each model is an independently deployable service connected through well-defined interfaces.
Detailed Explanation
Complex ML applications rarely consist of a single model. Real-world systems often involve:
- Preprocessing services that transform raw inputs (e.g., image resizing, text tokenization)
- Primary inference models that perform the core prediction task
- Post-processing services that format, filter, or enrich model outputs
- Ensemble services that combine predictions from multiple models
Service-oriented architecture provides a principled way to manage this complexity. By treating each model or processing step as an independent service, teams gain several advantages:
Core Composition Patterns
| Pattern | Description | Use Case |
|---|---|---|
| Sequential Pipeline | Services execute in order: A -> B -> C | Text extraction -> NLP analysis -> Summarization |
| Parallel Ensemble | Multiple services process the same input: A -> [B, C] -> D | Running multiple model variants and aggregating results |
| Mixed DAG | Combination of sequential and parallel flows | Preprocessing -> [Model A, Model B] -> Post-processing -> Ranking |
Design Principles
- Single Responsibility: Each service wraps exactly one model or processing step. This enables independent development, testing, and versioning of each component.
- Explicit Dependencies: Services declare their dependencies as part of their class definition. The framework resolves these dependencies at runtime, either in-process or across processes.
- Independent Resource Allocation: Different models have vastly different resource requirements. A preprocessing service may need only CPU, while an inference service requires GPU. Service decomposition enables per-service resource configuration.
- Fault Isolation: When one service in a pipeline fails, the failure is contained. Other services remain available, and error handling can be applied at the composition level.
- Independent Scaling: In production, each service can scale independently based on its throughput characteristics. A lightweight preprocessing service may need fewer replicas than a compute-heavy inference service.
When to Use Composition vs. Single Service
Composition is appropriate when:
- Models have different resource requirements (e.g., CPU vs. GPU)
- Components need to scale independently
- Teams want to deploy and version models independently
- The pipeline has parallelizable stages
A single service is sufficient when:
- The pipeline is simple and linear
- All models share the same resource profile
- Latency overhead from inter-service communication is unacceptable
- The models are always deployed together
Relationship to Implementation
This principle is implemented through BentoML's service composition system, where each model is wrapped in a @bentoml.service decorated class and an entry service uses bentoml.depends() to wire them into a dependency graph.
Implementation:Bentoml_BentoML_Service_Composition_Pattern