Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Bentoml BentoML Per Service Config

From Leeroopedia

Overview

Per Service Config is the YAML-based configuration pattern for specifying per-service resources, scaling, and environment settings in BentoML distributed deployments. Each service in a multi-service composition can be independently configured with its own instance type, autoscaling parameters, resource overrides, and environment variables.

This is a Pattern Doc documenting the YAML configuration structure for per-service deployment settings.

Interface Specification

Source Location

src/bentoml/_internal/cloud/deployment.py:L256-297

YAML Configuration Structure

The deployment configuration uses a services: section where each key is a service class name and the value contains that service's configuration.

services:
  Preprocessing:
    instance_type: cpu.small
    scaling:
      min_replicas: 1
      max_replicas: 5
  InferenceModel:
    instance_type: gpu.a100
    scaling:
      min_replicas: 1
      max_replicas: 3
    config_overrides:
      resources:
        gpu: 1
  Postprocessing:
    instance_type: cpu.small
    scaling:
      min_replicas: 1
      max_replicas: 5

Configuration Parameters Per Service

Parameter Type Description
instance_type string The hardware profile for this service (e.g., cpu.small, cpu.large, gpu.t4, gpu.a100). Determines the underlying machine type.
scaling.min_replicas integer Minimum number of replicas. The service will never scale below this count, ensuring baseline availability.
scaling.max_replicas integer Maximum number of replicas. The autoscaler will not exceed this count, providing cost control.
config_overrides object Fine-grained resource configuration that overrides defaults from the instance type.
config_overrides.resources object Resource requests including gpu (integer), memory (string, e.g., "16Gi"), and cpu (string or integer).
envs list Environment variables specific to this service. Each entry has name and value keys.

Example Implementations

Full Deployment Configuration

# bentoml_deployment.yaml
name: my-ml-pipeline
bento: ./
access_authorization: false
services:
  TextPreprocessor:
    instance_type: cpu.medium
    scaling:
      min_replicas: 2
      max_replicas: 10
    envs:
      - name: TOKENIZER_PATH
        value: /models/tokenizer
      - name: MAX_SEQUENCE_LENGTH
        value: "512"

  LLMInference:
    instance_type: gpu.a100
    scaling:
      min_replicas: 1
      max_replicas: 5
    config_overrides:
      resources:
        gpu: 1
        memory: "32Gi"
    envs:
      - name: MODEL_NAME
        value: "llama-7b"
      - name: PRECISION
        value: "float16"

  ResponseFormatter:
    instance_type: cpu.small
    scaling:
      min_replicas: 1
      max_replicas: 3

  SafetyFilter:
    instance_type: cpu.medium
    scaling:
      min_replicas: 1
      max_replicas: 5
    envs:
      - name: SAFETY_THRESHOLD
        value: "0.85"

Corresponding Service Code

import bentoml

@bentoml.service(resources={"cpu": "2", "memory": "4Gi"})
class TextPreprocessor:
    @bentoml.api
    def preprocess(self, text: str) -> list[int]:
        ...

@bentoml.service(resources={"gpu": 1, "memory": "32Gi"})
class LLMInference:
    @bentoml.api
    def generate(self, tokens: list[int]) -> str:
        ...

@bentoml.service(resources={"cpu": "1"})
class ResponseFormatter:
    @bentoml.api
    def format(self, raw_response: str) -> dict:
        ...

@bentoml.service(resources={"cpu": "2"})
class SafetyFilter:
    @bentoml.api
    def check(self, response: str) -> dict:
        ...

@bentoml.service
class Pipeline:
    preprocessor = bentoml.depends(TextPreprocessor)
    llm = bentoml.depends(LLMInference)
    formatter = bentoml.depends(ResponseFormatter)
    safety = bentoml.depends(SafetyFilter)

    @bentoml.api
    async def generate(self, text: str) -> dict:
        tokens = await self.preprocessor.to_async.preprocess(text)
        raw_response = await self.llm.to_async.generate(tokens)
        formatted = await self.formatter.to_async.format(raw_response)
        checked = await self.safety.to_async.check(raw_response)
        return {**formatted, "safety": checked}

Deploying with the Configuration

# Deploy to BentoCloud with per-service config
bentoml deploy --config bentoml_deployment.yaml

Resource Allocation Strategy

Service Type Recommended Instance Scaling Strategy Rationale
Preprocessing (CPU-bound) cpu.small to cpu.medium Higher max replicas Fast per-request, scale horizontally
Model Inference (GPU-bound) gpu.t4 to gpu.a100 Lower max replicas, batching Expensive hardware, optimize utilization
Post-processing (CPU-bound) cpu.small Moderate replicas Lightweight, rarely the bottleneck
Ensemble Aggregator cpu.small Match inference scaling Must handle combined output volume

Source Files

  • src/bentoml/_internal/cloud/deployment.py:L256-297 -- Deployment configuration parsing

Relationship to Principle

This configuration pattern implements the Distributed Deployment Configuration principle by providing a YAML-based mechanism for independently configuring resources, scaling, and environment settings for each service in a BentoML composition.

Principle:Bentoml_BentoML_Distributed_Deployment_Configuration

Metadata

Knowledge Sources

2026-02-13 15:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment