Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Ray project Ray Autoscaling And Monitoring

From Leeroopedia
Knowledge Sources
Domains Model_Serving, Auto_Scaling
Last Updated 2026-02-13 17:00 GMT

Overview

A reactive scaling mechanism that automatically adjusts deployment replica count based on observed request load metrics.

Description

Autoscaling and Monitoring enables deployments to dynamically adjust their replica count based on real-time metrics. Each replica reports metrics (request count, latency, ongoing requests) to the Serve controller, which uses a smoothed average to decide when to scale up or down. Configurable parameters control the target load, scaling bounds, observation windows, and cooldown delays.

Usage

Configure autoscaling when deployment load is variable and you want to optimize resource utilization while maintaining latency targets.

Theoretical Basis

Autoscaling implements a reactive control loop:

desiredReplicas=currentReplicas×observedLoadtargetLoad×smoothingFactor

The system uses hysteresis (upscale/downscale delays) to prevent oscillation, and a lookback window to smooth out transient load spikes.

Related Pages

Implemented By

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment