Workflow:Bentoml BentoML BentoCloud Deployment

Knowledge Sources	BentoML BentoML Docs Cloud Deployment Guide
Domains	ML_Serving, Cloud_Deployment, ML_Ops
Last Updated	2026-02-13 15:00 GMT

Overview

End-to-end process for deploying a BentoML Service to BentoCloud, the managed inference platform, including authentication, deployment creation, endpoint invocation, and lifecycle management.

Description

This workflow covers the full deployment lifecycle on BentoCloud, from initial authentication through deployment creation, monitoring, updating, and teardown. BentoCloud provides managed compute infrastructure with automatic scaling, GPU allocation, and observability. The bentoml deploy command handles the full pipeline of building a Bento, pushing it to the cloud registry, and creating a deployment in a single step. Once deployed, services are accessible via HTTPS endpoints and can be managed through both CLI and Python APIs.

Key capabilities covered:

BentoCloud authentication and API token management
One-command deployment with bentoml deploy
Deployment configuration (scaling, instance types, secrets, environment variables)
Endpoint invocation via HTTP clients
Deployment lifecycle management (update, scale, terminate, delete)

Usage

Execute this workflow when you have a working BentoML Service and need to deploy it to a managed cloud environment with automatic scaling, GPU support, and production-grade infrastructure. This is the recommended path for teams that want to avoid managing their own Kubernetes or Docker infrastructure.

Execution Steps

Step 1: Authenticate with BentoCloud

Sign up for a BentoCloud account and authenticate the local BentoML CLI. Run bentoml cloud login which guides through creating a new API token via the web browser or pasting an existing token. The credentials are stored locally for subsequent CLI commands.

Key considerations:

API tokens can be created and managed via the BentoCloud web console
Tokens can be scoped with different permission levels
The BENTOML_API_TOKEN environment variable can be used as an alternative to interactive login
Multiple contexts (organizations/clusters) can be configured

Step 2: Configure the Deployment

Define deployment settings including resource allocation, scaling policies, environment variables, and secrets. Configuration can be specified inline via CLI flags, through a YAML configuration file, or via the Python API. For distributed services with multiple components, a configuration file is recommended to specify per-service settings.

Key considerations:

Instance types determine CPU/GPU/memory allocation (e.g., gpu.l4.1, cpu.4)
Scaling policies set minimum and maximum replica counts
Secrets securely inject sensitive values (API keys, credentials)
Environment variables configure runtime behavior
Use bentoml deployment list-instance-types to see available resources

Step 3: Deploy the Service

Run bentoml deploy from the project directory to execute the full deployment pipeline. This command automatically builds the Bento, pushes it to the BentoCloud registry, and creates a deployment. Optionally specify a name with -n flag and a configuration file with -f flag. The Python API equivalent is bentoml.deployment.create().

Key considerations:

bentoml deploy combines build, push, and deploy into one command
The first deployment may take longer as the container image is built
Use -n to set a custom deployment name
Use -f config.yaml to apply a full deployment configuration
The command waits until the deployment is ready by default

Step 4: Invoke Deployment Endpoints

Once the deployment is ready, access the service via its HTTPS endpoint URL. Retrieve the URL with bentoml deployment get <name> or from the BentoCloud console. Use SyncHTTPClient or AsyncHTTPClient with the deployment URL to make inference calls programmatically, or use standard HTTP tools.

Key considerations:

Deployment URLs follow the pattern https://<name>-<hash>.<region>.bentoml.ai
The BentoML client handles authentication automatically when logged in
Access authorization can be enabled to require API tokens for endpoint access
The BentoCloud console provides a Playground for interactive testing

Step 5: Manage the Deployment

Use CLI commands or the Python API to manage the deployment lifecycle. Update the deployment with new code or configuration changes using bentoml deployment update. Adjust scaling with --scaling-min and --scaling-max flags. Monitor deployment status, logs, and metrics through the BentoCloud console.

Key considerations:

bentoml deployment update applies code or configuration changes
bentoml deployment list shows all active deployments
bentoml deployment get <name> retrieves detailed deployment information
Scaling can be set to min=0 for scale-to-zero capability
Canary deployments are supported for gradual rollouts

Step 6: Terminate and Clean Up

Stop and remove deployments that are no longer needed. Use bentoml deployment terminate <name> to stop a deployment (preserving its configuration) or bentoml deployment delete <name> to permanently remove it. Clean up associated Bentos and models from the cloud registry as needed.

Key considerations:

Terminate stops the deployment but retains its configuration
Delete permanently removes the deployment record
Associated Bentos remain in the cloud registry after deployment deletion
Use the BentoCloud console or CLI for Bento and model cleanup

Execution Diagram

GitHub URL

Workflow Repository