Heuristic:Ray project Ray Autoscaling Delay Tuning

Knowledge Sources	Ray Autoscaling Guide
Domains	Optimization, Serve, Autoscaling
Last Updated	2026-02-13 16:35 GMT

Overview

Asymmetric delay strategy for Ray Serve autoscaling: 30-second upscale delay for fast response vs 600-second (10 minute) downscale delay to prevent thrashing.

Description

Ray Serve autoscaling uses an asymmetric delay pattern where scaling up is fast (30s delay) and scaling down is slow (600s delay). This design reflects the different costs of each action: under-provisioning causes user-facing latency spikes, while over-provisioning only wastes compute resources temporarily. A smoothing factor (default 1.0) acts as a multiplicative gain to limit how aggressively the autoscaler reacts to metric fluctuations. The metrics lookback window (30s) averages recent traffic to avoid reacting to transient spikes.

Usage

Apply this heuristic when configuring Ray Serve autoscaling for production deployments. It is especially relevant when:

You observe replicas scaling up and down repeatedly (thrashing)
You need to balance cost efficiency with latency SLAs
You are tuning the `AutoscalingConfig` for a deployment with variable traffic patterns

The Insight (Rule of Thumb)

Action: Set `upscaleDelayS` to a low value (default 30s) and `downscaleDelayS` to a high value (default 600s = 10 minutes).
Value: The 20:1 ratio between downscale and upscale delay is the default and works well for most workloads.
Trade-off: Lower upscale delay = faster response to traffic spikes but more risk of over-provisioning. Lower downscale delay = better cost efficiency but risk of thrashing.
Additional tuning:
- `smoothingFactor` = 1.0 (default). Values < 1.0 dampen scaling; values > 1.0 amplify it.
- `metricsIntervalS` = 10.0s (scrape frequency). Lower values give faster reaction but more noise.
- `lookBackPeriodS` = 30.0s (averaging window). Shorter windows react faster; longer windows smooth more.
- `targetOngoingRequests` should be smaller for longer-running requests and for lower latency objectives.
- `maxReplicas` should be set ~20% higher than expected peak traffic needs.

Reasoning

The asymmetric delay is grounded in the observation that:

Upscaling cost is high latency: When demand exceeds capacity, requests queue and users experience latency. Fast upscaling mitigates this quickly.
Downscaling cost is only money: Over-provisioned replicas waste compute but do not harm users. Conservative downscaling avoids the thrashing scenario where replicas are created and destroyed repeatedly.
Hysteresis prevents oscillation: The large gap between upscale and downscale delays creates a hysteresis band. Traffic must remain low for a sustained period (10 minutes) before replicas are removed, preventing oscillation around the scaling threshold.

The smoothing factor limits the magnitude of each scaling decision. With `smoothingFactor=1.0`, the autoscaler applies the full calculated adjustment. Values below 1.0 are useful when metrics are noisy.

The controller tracks separate timestamps for scale-up and scale-down events to implement this hysteresis correctly (see `autoscaling_state.py`).

Code Evidence

Default autoscaling delays from `AutoscalingConfig.java:17-19`:

/** How long to wait before scaling down replicas */
private double downscaleDelayS = 600.0;
/** How long to wait before scaling up replicas */
private double upscaleDelayS = 30.0;

Smoothing factor and metrics configuration from `AutoscalingConfig.java:11-15`:

/** How often to scrape for metrics */
private double metricsIntervalS = 10.0;
/** Time window to average over for metrics. */
private double lookBackPeriodS = 30.0;
/** Multiplicative "gain" factor to limit scaling decisions */
private double smoothingFactor = 1.0;

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment