Implementation:Bentoml BentoML Per Service Config
Overview
Per Service Config is the YAML-based configuration pattern for specifying per-service resources, scaling, and environment settings in BentoML distributed deployments. Each service in a multi-service composition can be independently configured with its own instance type, autoscaling parameters, resource overrides, and environment variables.
This is a Pattern Doc documenting the YAML configuration structure for per-service deployment settings.
Interface Specification
Source Location
src/bentoml/_internal/cloud/deployment.py:L256-297
YAML Configuration Structure
The deployment configuration uses a services: section where each key is a service class name and the value contains that service's configuration.
services:
Preprocessing:
instance_type: cpu.small
scaling:
min_replicas: 1
max_replicas: 5
InferenceModel:
instance_type: gpu.a100
scaling:
min_replicas: 1
max_replicas: 3
config_overrides:
resources:
gpu: 1
Postprocessing:
instance_type: cpu.small
scaling:
min_replicas: 1
max_replicas: 5
Configuration Parameters Per Service
| Parameter | Type | Description |
|---|---|---|
instance_type |
string |
The hardware profile for this service (e.g., cpu.small, cpu.large, gpu.t4, gpu.a100). Determines the underlying machine type.
|
scaling.min_replicas |
integer |
Minimum number of replicas. The service will never scale below this count, ensuring baseline availability. |
scaling.max_replicas |
integer |
Maximum number of replicas. The autoscaler will not exceed this count, providing cost control. |
config_overrides |
object |
Fine-grained resource configuration that overrides defaults from the instance type. |
config_overrides.resources |
object |
Resource requests including gpu (integer), memory (string, e.g., "16Gi"), and cpu (string or integer).
|
envs |
list |
Environment variables specific to this service. Each entry has name and value keys.
|
Example Implementations
Full Deployment Configuration
# bentoml_deployment.yaml
name: my-ml-pipeline
bento: ./
access_authorization: false
services:
TextPreprocessor:
instance_type: cpu.medium
scaling:
min_replicas: 2
max_replicas: 10
envs:
- name: TOKENIZER_PATH
value: /models/tokenizer
- name: MAX_SEQUENCE_LENGTH
value: "512"
LLMInference:
instance_type: gpu.a100
scaling:
min_replicas: 1
max_replicas: 5
config_overrides:
resources:
gpu: 1
memory: "32Gi"
envs:
- name: MODEL_NAME
value: "llama-7b"
- name: PRECISION
value: "float16"
ResponseFormatter:
instance_type: cpu.small
scaling:
min_replicas: 1
max_replicas: 3
SafetyFilter:
instance_type: cpu.medium
scaling:
min_replicas: 1
max_replicas: 5
envs:
- name: SAFETY_THRESHOLD
value: "0.85"
Corresponding Service Code
import bentoml
@bentoml.service(resources={"cpu": "2", "memory": "4Gi"})
class TextPreprocessor:
@bentoml.api
def preprocess(self, text: str) -> list[int]:
...
@bentoml.service(resources={"gpu": 1, "memory": "32Gi"})
class LLMInference:
@bentoml.api
def generate(self, tokens: list[int]) -> str:
...
@bentoml.service(resources={"cpu": "1"})
class ResponseFormatter:
@bentoml.api
def format(self, raw_response: str) -> dict:
...
@bentoml.service(resources={"cpu": "2"})
class SafetyFilter:
@bentoml.api
def check(self, response: str) -> dict:
...
@bentoml.service
class Pipeline:
preprocessor = bentoml.depends(TextPreprocessor)
llm = bentoml.depends(LLMInference)
formatter = bentoml.depends(ResponseFormatter)
safety = bentoml.depends(SafetyFilter)
@bentoml.api
async def generate(self, text: str) -> dict:
tokens = await self.preprocessor.to_async.preprocess(text)
raw_response = await self.llm.to_async.generate(tokens)
formatted = await self.formatter.to_async.format(raw_response)
checked = await self.safety.to_async.check(raw_response)
return {**formatted, "safety": checked}
Deploying with the Configuration
# Deploy to BentoCloud with per-service config
bentoml deploy --config bentoml_deployment.yaml
Resource Allocation Strategy
| Service Type | Recommended Instance | Scaling Strategy | Rationale |
|---|---|---|---|
| Preprocessing (CPU-bound) | cpu.small to cpu.medium |
Higher max replicas | Fast per-request, scale horizontally |
| Model Inference (GPU-bound) | gpu.t4 to gpu.a100 |
Lower max replicas, batching | Expensive hardware, optimize utilization |
| Post-processing (CPU-bound) | cpu.small |
Moderate replicas | Lightweight, rarely the bottleneck |
| Ensemble Aggregator | cpu.small |
Match inference scaling | Must handle combined output volume |
Source Files
src/bentoml/_internal/cloud/deployment.py:L256-297-- Deployment configuration parsing
Relationship to Principle
This configuration pattern implements the Distributed Deployment Configuration principle by providing a YAML-based mechanism for independently configuring resources, scaling, and environment settings for each service in a BentoML composition.
Principle:Bentoml_BentoML_Distributed_Deployment_Configuration
Metadata
- ML_Serving
- Service_Composition
- Distributed_Systems
- Cloud_Deployment
- Infrastructure_Configuration
- Multi_Model_Composition