Workflow:Ray project Ray Serve Deployment

Knowledge Sources	Ray Ray Serve
Domains	Model_Serving, ML_Ops, Distributed_Computing
Last Updated	2026-02-13 16:00 GMT

Overview

End-to-end process for deploying, scaling, and serving machine learning models or business logic as HTTP endpoints using Ray Serve's deployment framework.

Description

This workflow covers the complete process of deploying applications with Ray Serve. It starts with initializing the Serve control plane (controller actor, HTTP proxies), proceeds through defining deployment classes with scaling and resource configurations, deploying them as applications via the DAG-based composition system, invoking deployments through handles or HTTP, and managing the deployment lifecycle. Ray Serve builds on top of Ray actors, providing autoscaling, health checking, rolling updates, and request routing across deployment replicas.

Usage

Execute this workflow when you need to serve machine learning models or computation-heavy business logic behind an HTTP API with production-grade features like autoscaling, replica management, and request routing. Applicable for real-time inference, online prediction services, multi-model composition pipelines, and any scenario requiring scalable request-response serving.

Execution Steps

Step 1: Initialize Serve Control Plane

Start the Ray Serve system by calling the Serve start API. This initializes or connects to the Serve controller, a detached actor that manages all deployment state. The process first checks for an existing controller, and if none exists, initializes Ray (if needed), creates a controller avatar actor for health checking, waits for HTTP proxy actors to become ready, and returns a ServeControllerClient for deployment management.

Key considerations:

Controller is a detached actor with infinite restarts for fault tolerance
All Serve actors run in a dedicated namespace for isolation
HTTP proxies must become healthy before start returns (with configurable timeout)
If a controller already exists, the call reconnects rather than creating a new one

Step 2: Define Deployment Class

Create the class that will handle requests. The deployment class contains the business logic (model inference, computation, etc.) and can maintain internal state. It receives requests through a designated call method. The class is packaged with its configuration (replica count, resource requirements, autoscaling parameters) via the DeploymentCreator builder.

Key considerations:

Deployment classes are instantiated as Ray actors (one per replica)
Each replica maintains independent state
The class name is used as the deployment identifier by default
Constructor arguments configure model loading or initialization

Step 3: Configure and Bind Deployment

Use the deployment builder to set the deployment name, number of replicas, resource allocation per replica, autoscaling configuration, and any constructor arguments. Calling bind() on the builder creates an Application object wrapping a DAG node. Multiple deployments can be composed into a DAG for model composition pipelines.

Key considerations:

Replica count controls horizontal scaling
Autoscaling config enables dynamic scaling based on request load
Route prefix must start with "/" if specified
Version strings enable rolling updates
Multiple deployments can be bound into a graph using DAG composition

Step 4: Deploy Application

Submit the application to the Serve controller by calling the run API. This extracts deployments from the DAG structure, assigns version identifiers, and instructs the controller to create or update deployment replicas. The controller manages replica lifecycle, including creating new replicas, draining old ones during updates, and maintaining the desired replica count.

Key considerations:

Deployment is managed by the controller as a long-running process
Returns a DeploymentHandle pointing to the ingress deployment
Blocking mode waits until all replicas are healthy
Non-blocking mode returns immediately while deployment proceeds

Step 5: Send Requests to Deployment

Invoke the deployed application through the DeploymentHandle (programmatic) or HTTP endpoint (external clients). The handle's remote method submits requests to the router, which selects a healthy replica from the ReplicaSet and forwards the request. Responses are returned as DeploymentResponse objects (futures). For HTTP, the proxy actor handles request parsing and routing.

Key considerations:

Programmatic handles use method routing for specific endpoint selection
Parallel request submission (collect responses, then aggregate) is more efficient than sequential blocking
The router load-balances across available replicas
Long-poll synchronization keeps replica sets current with controller state

Step 6: Monitor and Scale

The Serve controller continuously monitors deployment health, adjusts replica counts based on autoscaling policies, and handles replica failures. Metrics are exposed for monitoring systems. Deployments can be updated in-place by re-running with new configurations, triggering rolling updates.

Key considerations:

Autoscaling responds to queue depth and request rate
Failed replicas are automatically replaced up to configured limits
Deployment updates use rolling strategy to minimize downtime
Metrics include request latency, queue size, and replica count

Step 7: Shutdown Serve

Terminate the Serve system and all managed deployments. Individual applications can be deleted by name, or the entire Serve system can be shut down. Deletion can be blocking (waits for cleanup) or non-blocking.

Key considerations:

Application deletion removes all associated deployments and replicas
Full shutdown terminates the controller and all proxy actors
After shutdown, Serve must be re-initialized for new deployments

Execution Diagram

GitHub URL

Workflow Repository