Workflow:Bentoml BentoML BentoCloud Deployment
| Knowledge Sources | |
|---|---|
| Domains | ML_Serving, Cloud_Deployment, ML_Ops |
| Last Updated | 2026-02-13 15:00 GMT |
Overview
End-to-end process for deploying a BentoML Service to BentoCloud, the managed inference platform, including authentication, deployment creation, endpoint invocation, and lifecycle management.
Description
This workflow covers the full deployment lifecycle on BentoCloud, from initial authentication through deployment creation, monitoring, updating, and teardown. BentoCloud provides managed compute infrastructure with automatic scaling, GPU allocation, and observability. The bentoml deploy command handles the full pipeline of building a Bento, pushing it to the cloud registry, and creating a deployment in a single step. Once deployed, services are accessible via HTTPS endpoints and can be managed through both CLI and Python APIs.
Key capabilities covered:
- BentoCloud authentication and API token management
- One-command deployment with bentoml deploy
- Deployment configuration (scaling, instance types, secrets, environment variables)
- Endpoint invocation via HTTP clients
- Deployment lifecycle management (update, scale, terminate, delete)
Usage
Execute this workflow when you have a working BentoML Service and need to deploy it to a managed cloud environment with automatic scaling, GPU support, and production-grade infrastructure. This is the recommended path for teams that want to avoid managing their own Kubernetes or Docker infrastructure.
Execution Steps
Step 1: Authenticate with BentoCloud
Sign up for a BentoCloud account and authenticate the local BentoML CLI. Run bentoml cloud login which guides through creating a new API token via the web browser or pasting an existing token. The credentials are stored locally for subsequent CLI commands.
Key considerations:
- API tokens can be created and managed via the BentoCloud web console
- Tokens can be scoped with different permission levels
- The BENTOML_API_TOKEN environment variable can be used as an alternative to interactive login
- Multiple contexts (organizations/clusters) can be configured
Step 2: Configure the Deployment
Define deployment settings including resource allocation, scaling policies, environment variables, and secrets. Configuration can be specified inline via CLI flags, through a YAML configuration file, or via the Python API. For distributed services with multiple components, a configuration file is recommended to specify per-service settings.
Key considerations:
- Instance types determine CPU/GPU/memory allocation (e.g., gpu.l4.1, cpu.4)
- Scaling policies set minimum and maximum replica counts
- Secrets securely inject sensitive values (API keys, credentials)
- Environment variables configure runtime behavior
- Use bentoml deployment list-instance-types to see available resources
Step 3: Deploy the Service
Run bentoml deploy from the project directory to execute the full deployment pipeline. This command automatically builds the Bento, pushes it to the BentoCloud registry, and creates a deployment. Optionally specify a name with -n flag and a configuration file with -f flag. The Python API equivalent is bentoml.deployment.create().
Key considerations:
- bentoml deploy combines build, push, and deploy into one command
- The first deployment may take longer as the container image is built
- Use -n to set a custom deployment name
- Use -f config.yaml to apply a full deployment configuration
- The command waits until the deployment is ready by default
Step 4: Invoke Deployment Endpoints
Once the deployment is ready, access the service via its HTTPS endpoint URL. Retrieve the URL with bentoml deployment get <name> or from the BentoCloud console. Use SyncHTTPClient or AsyncHTTPClient with the deployment URL to make inference calls programmatically, or use standard HTTP tools.
Key considerations:
- Deployment URLs follow the pattern https://<name>-<hash>.<region>.bentoml.ai
- The BentoML client handles authentication automatically when logged in
- Access authorization can be enabled to require API tokens for endpoint access
- The BentoCloud console provides a Playground for interactive testing
Step 5: Manage the Deployment
Use CLI commands or the Python API to manage the deployment lifecycle. Update the deployment with new code or configuration changes using bentoml deployment update. Adjust scaling with --scaling-min and --scaling-max flags. Monitor deployment status, logs, and metrics through the BentoCloud console.
Key considerations:
- bentoml deployment update applies code or configuration changes
- bentoml deployment list shows all active deployments
- bentoml deployment get <name> retrieves detailed deployment information
- Scaling can be set to min=0 for scale-to-zero capability
- Canary deployments are supported for gradual rollouts
Step 6: Terminate and Clean Up
Stop and remove deployments that are no longer needed. Use bentoml deployment terminate <name> to stop a deployment (preserving its configuration) or bentoml deployment delete <name> to permanently remove it. Clean up associated Bentos and models from the cloud registry as needed.
Key considerations:
- Terminate stops the deployment but retains its configuration
- Delete permanently removes the deployment record
- Associated Bentos remain in the cloud registry after deployment deletion
- Use the BentoCloud console or CLI for Bento and model cleanup