Principle:Ray project Ray Autoscaling And Monitoring
| Knowledge Sources | |
|---|---|
| Domains | Model_Serving, Auto_Scaling |
| Last Updated | 2026-02-13 17:00 GMT |
Overview
A reactive scaling mechanism that automatically adjusts deployment replica count based on observed request load metrics.
Description
Autoscaling and Monitoring enables deployments to dynamically adjust their replica count based on real-time metrics. Each replica reports metrics (request count, latency, ongoing requests) to the Serve controller, which uses a smoothed average to decide when to scale up or down. Configurable parameters control the target load, scaling bounds, observation windows, and cooldown delays.
Usage
Configure autoscaling when deployment load is variable and you want to optimize resource utilization while maintaining latency targets.
Theoretical Basis
Autoscaling implements a reactive control loop:
The system uses hysteresis (upscale/downscale delays) to prevent oscillation, and a lookback window to smooth out transient load spikes.