Principle:Ray project Ray Autoscaling And Monitoring

Knowledge Sources	Ray Serve Autoscaling Ray
Domains	Model_Serving, Auto_Scaling
Last Updated	2026-02-13 17:00 GMT

Overview

A reactive scaling mechanism that automatically adjusts deployment replica count based on observed request load metrics.

Description

Autoscaling and Monitoring enables deployments to dynamically adjust their replica count based on real-time metrics. Each replica reports metrics (request count, latency, ongoing requests) to the Serve controller, which uses a smoothed average to decide when to scale up or down. Configurable parameters control the target load, scaling bounds, observation windows, and cooldown delays.

Usage

Configure autoscaling when deployment load is variable and you want to optimize resource utilization while maintaining latency targets.

Theoretical Basis

Autoscaling implements a reactive control loop:

$desiredReplicas = currentReplicas \times \frac{observedLoad}{targetLoad} \times smoothingFactor$

The system uses hysteresis (upscale/downscale delays) to prevent oscillation, and a lookback window to smooth out transient load spikes.

Related Pages

Implemented By

Implementation:Ray_project_Ray_AutoscalingConfig_Setup

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment