Implementation:Bentoml BentoML Ray Integration

Knowledge Sources	Bentoml_BentoML
Domains	Ray, Deployment, Distributed Computing
Last Updated	2026-02-13 15:00 GMT

Overview

Provides integration between BentoML services and Ray Serve, enabling deployment of BentoML bentos and runners as Ray Serve deployments.

Description

This module bridges BentoML with Ray Serve by converting BentoML services and runners into Ray Serve deployment objects. The main entry point is the deployment() function, which accepts a BentoML service (or Bento tag) and produces a bound Ray Serve deployment. Internally, each BentoML runner is wrapped in a RunnerDeployment class that is registered as a separate Ray Serve deployment, while the BentoML API server is wrapped in a BentoDeployment class that routes HTTP requests through BentoML's HTTPAppFactory. The module supports Ray Serve's native batching via the @serve.batch decorator for runner methods that have batchable=True. Configuration for both the service deployment and individual runner deployments can be specified through dictionaries. The helper function get_bento_runtime_env() is a placeholder for creating Ray RuntimeEnv from Bento environment configs.

Usage

Use this module when you want to deploy a BentoML service on a Ray cluster using Ray Serve. It allows you to leverage Ray's distributed computing capabilities for scaling BentoML runners independently.

Code Reference

Source Location

Repository: Bentoml_BentoML
File: src/bentoml/_internal/ray/__init__.py
Lines: 1-219

Signature

def deployment(
    target: str | Tag | bentoml.Bento | bentoml.legacy.Service,
    service_deployment_config: dict[str, t.Any] | None = None,
    runners_deployment_config_map: dict[str, dict[str, t.Any]] | None = None,
    enable_batching: bool = False,
    batching_config: dict[str, dict[str, dict[str, float | int]]] | None = None,
) -> Deployment: ...

def get_bento_runtime_env(bento_tag: str | Tag) -> RuntimeEnv: ...

def _get_runner_deployment(
    svc: bentoml.legacy.Service,
    runner_name: str,
    runner_deployment_config: dict[str, t.Any],
    enable_batching: bool,
    batching_config: dict[str, dict[str, float | int]],
) -> Deployment: ...

def _get_service_deployment(svc: bentoml.legacy.Service, **kwargs: t.Any) -> Deployment: ...

def _deploy_bento_runners(
    svc: bentoml.legacy.Service,
    runners_deployment_config_map: dict | None = None,
    enable_batching: bool = False,
    batching_config: dict | None = None,
) -> dict[str, Deployment]: ...

Import

from bentoml._internal.ray import deployment
# or via the public API:
import bentoml
classifier = bentoml.ray.deployment("iris_classifier:latest")

I/O Contract

Inputs

Name	Type	Required	Description
target	str, Tag, Bento, or Service	Yes	A BentoML service instance, Bento tag string, or Bento object to deploy
service_deployment_config	dict[str, Any] or None	No	Ray deployment config for the BentoML API server (e.g., num_replicas, route_prefix)
runners_deployment_config_map	dict[str, dict[str, Any]] or None	No	Ray deployment config map keyed by runner name
enable_batching	bool	No	Enable Ray Serve batching for batchable runner methods; defaults to False
batching_config	dict or None	No	Ray batching config by runner name and method name (e.g., max_batch_size, batch_wait_timeout_s)

Outputs

Name	Type	Description
deployment	ray.serve.Deployment	A bound Ray Serve Deployment ready to be run with serve run

Usage Examples

import bentoml

# Basic deployment
classifier = bentoml.ray.deployment("iris_classifier:latest")

# Configured deployment with scaling
classifier = bentoml.ray.deployment(
    "iris_classifier:latest",
    {"route_prefix": "/hello", "num_replicas": 3, "ray_actor_options": {"num_cpus": 1}},
    {"iris_clf": {"num_replicas": 1, "ray_actor_options": {"num_cpus": 5}}},
)

# With batching enabled
deploy = bentoml.ray.deployment(
    "fraud_detection:latest",
    {"num_replicas": 5, "ray_actor_options": {"num_cpus": 1}},
    {"ieee-fraud-detection-sm": {"num_replicas": 1, "ray_actor_options": {"num_cpus": 5}}},
    enable_batching=True,
    batching_config={
        "ieee-fraud-detection-sm": {
            "predict_proba": {"max_batch_size": 5, "batch_wait_timeout_s": 0.2}
        }
    },
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment