Workflow:Ray project Ray Serve Deployment
| Knowledge Sources | |
|---|---|
| Domains | Model_Serving, ML_Ops, Distributed_Computing |
| Last Updated | 2026-02-13 16:00 GMT |
Overview
End-to-end process for deploying, scaling, and serving machine learning models or business logic as HTTP endpoints using Ray Serve's deployment framework.
Description
This workflow covers the complete process of deploying applications with Ray Serve. It starts with initializing the Serve control plane (controller actor, HTTP proxies), proceeds through defining deployment classes with scaling and resource configurations, deploying them as applications via the DAG-based composition system, invoking deployments through handles or HTTP, and managing the deployment lifecycle. Ray Serve builds on top of Ray actors, providing autoscaling, health checking, rolling updates, and request routing across deployment replicas.
Usage
Execute this workflow when you need to serve machine learning models or computation-heavy business logic behind an HTTP API with production-grade features like autoscaling, replica management, and request routing. Applicable for real-time inference, online prediction services, multi-model composition pipelines, and any scenario requiring scalable request-response serving.
Execution Steps
Step 1: Initialize Serve Control Plane
Start the Ray Serve system by calling the Serve start API. This initializes or connects to the Serve controller, a detached actor that manages all deployment state. The process first checks for an existing controller, and if none exists, initializes Ray (if needed), creates a controller avatar actor for health checking, waits for HTTP proxy actors to become ready, and returns a ServeControllerClient for deployment management.
Key considerations:
- Controller is a detached actor with infinite restarts for fault tolerance
- All Serve actors run in a dedicated namespace for isolation
- HTTP proxies must become healthy before start returns (with configurable timeout)
- If a controller already exists, the call reconnects rather than creating a new one
Step 2: Define Deployment Class
Create the class that will handle requests. The deployment class contains the business logic (model inference, computation, etc.) and can maintain internal state. It receives requests through a designated call method. The class is packaged with its configuration (replica count, resource requirements, autoscaling parameters) via the DeploymentCreator builder.
Key considerations:
- Deployment classes are instantiated as Ray actors (one per replica)
- Each replica maintains independent state
- The class name is used as the deployment identifier by default
- Constructor arguments configure model loading or initialization
Step 3: Configure and Bind Deployment
Use the deployment builder to set the deployment name, number of replicas, resource allocation per replica, autoscaling configuration, and any constructor arguments. Calling bind() on the builder creates an Application object wrapping a DAG node. Multiple deployments can be composed into a DAG for model composition pipelines.
Key considerations:
- Replica count controls horizontal scaling
- Autoscaling config enables dynamic scaling based on request load
- Route prefix must start with "/" if specified
- Version strings enable rolling updates
- Multiple deployments can be bound into a graph using DAG composition
Step 4: Deploy Application
Submit the application to the Serve controller by calling the run API. This extracts deployments from the DAG structure, assigns version identifiers, and instructs the controller to create or update deployment replicas. The controller manages replica lifecycle, including creating new replicas, draining old ones during updates, and maintaining the desired replica count.
Key considerations:
- Deployment is managed by the controller as a long-running process
- Returns a DeploymentHandle pointing to the ingress deployment
- Blocking mode waits until all replicas are healthy
- Non-blocking mode returns immediately while deployment proceeds
Step 5: Send Requests to Deployment
Invoke the deployed application through the DeploymentHandle (programmatic) or HTTP endpoint (external clients). The handle's remote method submits requests to the router, which selects a healthy replica from the ReplicaSet and forwards the request. Responses are returned as DeploymentResponse objects (futures). For HTTP, the proxy actor handles request parsing and routing.
Key considerations:
- Programmatic handles use method routing for specific endpoint selection
- Parallel request submission (collect responses, then aggregate) is more efficient than sequential blocking
- The router load-balances across available replicas
- Long-poll synchronization keeps replica sets current with controller state
Step 6: Monitor and Scale
The Serve controller continuously monitors deployment health, adjusts replica counts based on autoscaling policies, and handles replica failures. Metrics are exposed for monitoring systems. Deployments can be updated in-place by re-running with new configurations, triggering rolling updates.
Key considerations:
- Autoscaling responds to queue depth and request rate
- Failed replicas are automatically replaced up to configured limits
- Deployment updates use rolling strategy to minimize downtime
- Metrics include request latency, queue size, and replica count
Step 7: Shutdown Serve
Terminate the Serve system and all managed deployments. Individual applications can be deleted by name, or the entire Serve system can be shut down. Deletion can be blocking (waits for cleanup) or non-blocking.
Key considerations:
- Application deletion removes all associated deployments and replicas
- Full shutdown terminates the controller and all proxy actors
- After shutdown, Serve must be re-initialized for new deployments