Implementation:Bentoml BentoML Per Service Config

Overview

Per Service Config is the YAML-based configuration pattern for specifying per-service resources, scaling, and environment settings in BentoML distributed deployments. Each service in a multi-service composition can be independently configured with its own instance type, autoscaling parameters, resource overrides, and environment variables.

This is a Pattern Doc documenting the YAML configuration structure for per-service deployment settings.

Interface Specification

Source Location

src/bentoml/_internal/cloud/deployment.py:L256-297

YAML Configuration Structure

The deployment configuration uses a services: section where each key is a service class name and the value contains that service's configuration.

services:
  Preprocessing:
    instance_type: cpu.small
    scaling:
      min_replicas: 1
      max_replicas: 5
  InferenceModel:
    instance_type: gpu.a100
    scaling:
      min_replicas: 1
      max_replicas: 3
    config_overrides:
      resources:
        gpu: 1
  Postprocessing:
    instance_type: cpu.small
    scaling:
      min_replicas: 1
      max_replicas: 5

Configuration Parameters Per Service

Parameter	Type	Description
`instance_type`	`string`	The hardware profile for this service (e.g., `cpu.small`, `cpu.large`, `gpu.t4`, `gpu.a100`). Determines the underlying machine type.
`scaling.min_replicas`	`integer`	Minimum number of replicas. The service will never scale below this count, ensuring baseline availability.
`scaling.max_replicas`	`integer`	Maximum number of replicas. The autoscaler will not exceed this count, providing cost control.
`config_overrides`	`object`	Fine-grained resource configuration that overrides defaults from the instance type.
`config_overrides.resources`	`object`	Resource requests including `gpu` (integer), `memory` (string, e.g., `"16Gi"`), and `cpu` (string or integer).
`envs`	`list`	Environment variables specific to this service. Each entry has `name` and `value` keys.

Example Implementations

Full Deployment Configuration

# bentoml_deployment.yaml
name: my-ml-pipeline
bento: ./
access_authorization: false
services:
  TextPreprocessor:
    instance_type: cpu.medium
    scaling:
      min_replicas: 2
      max_replicas: 10
    envs:
      - name: TOKENIZER_PATH
        value: /models/tokenizer
      - name: MAX_SEQUENCE_LENGTH
        value: "512"

  LLMInference:
    instance_type: gpu.a100
    scaling:
      min_replicas: 1
      max_replicas: 5
    config_overrides:
      resources:
        gpu: 1
        memory: "32Gi"
    envs:
      - name: MODEL_NAME
        value: "llama-7b"
      - name: PRECISION
        value: "float16"

  ResponseFormatter:
    instance_type: cpu.small
    scaling:
      min_replicas: 1
      max_replicas: 3

  SafetyFilter:
    instance_type: cpu.medium
    scaling:
      min_replicas: 1
      max_replicas: 5
    envs:
      - name: SAFETY_THRESHOLD
        value: "0.85"

Corresponding Service Code

import bentoml

@bentoml.service(resources={"cpu": "2", "memory": "4Gi"})
class TextPreprocessor:
    @bentoml.api
    def preprocess(self, text: str) -> list[int]:
        ...

@bentoml.service(resources={"gpu": 1, "memory": "32Gi"})
class LLMInference:
    @bentoml.api
    def generate(self, tokens: list[int]) -> str:
        ...

@bentoml.service(resources={"cpu": "1"})
class ResponseFormatter:
    @bentoml.api
    def format(self, raw_response: str) -> dict:
        ...

@bentoml.service(resources={"cpu": "2"})
class SafetyFilter:
    @bentoml.api
    def check(self, response: str) -> dict:
        ...

@bentoml.service
class Pipeline:
    preprocessor = bentoml.depends(TextPreprocessor)
    llm = bentoml.depends(LLMInference)
    formatter = bentoml.depends(ResponseFormatter)
    safety = bentoml.depends(SafetyFilter)

    @bentoml.api
    async def generate(self, text: str) -> dict:
        tokens = await self.preprocessor.to_async.preprocess(text)
        raw_response = await self.llm.to_async.generate(tokens)
        formatted = await self.formatter.to_async.format(raw_response)
        checked = await self.safety.to_async.check(raw_response)
        return {**formatted, "safety": checked}

Deploying with the Configuration

# Deploy to BentoCloud with per-service config
bentoml deploy --config bentoml_deployment.yaml

Resource Allocation Strategy

Service Type	Recommended Instance	Scaling Strategy	Rationale
Preprocessing (CPU-bound)	`cpu.small` to `cpu.medium`	Higher max replicas	Fast per-request, scale horizontally
Model Inference (GPU-bound)	`gpu.t4` to `gpu.a100`	Lower max replicas, batching	Expensive hardware, optimize utilization
Post-processing (CPU-bound)	`cpu.small`	Moderate replicas	Lightweight, rarely the bottleneck
Ensemble Aggregator	`cpu.small`	Match inference scaling	Must handle combined output volume

Source Files

src/bentoml/_internal/cloud/deployment.py:L256-297 -- Deployment configuration parsing

Relationship to Principle

This configuration pattern implements the Distributed Deployment Configuration principle by providing a YAML-based mechanism for independently configuring resources, scaling, and environment settings for each service in a BentoML composition.

Principle:Bentoml_BentoML_Distributed_Deployment_Configuration

Metadata

Knowledge Sources

2026-02-13 15:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment