Heuristic:Ray project Ray Serve Concurrency And Backpressure

Knowledge Sources	Ray Serve Configuration
Domains	Serve, Optimization, Backpressure
Last Updated	2026-02-13 16:35 GMT

Overview

Concurrency tuning strategy for Ray Serve: set `maxOngoingRequests` to 100 per replica (default) and controller concurrency to 15,000 to handle O(num_handles) long-poll connections.

Description

Ray Serve uses two key concurrency parameters that directly affect system stability and throughput. The per-replica `maxOngoingRequests` (default 100) controls backpressure: when a replica reaches this limit, new requests are queued at the router level. The controller's `CONTROLLER_MAX_CONCURRENCY` (15,000) must scale with the number of deployment handles because each handle maintains one long-poll connection to the controller. Understanding these parameters prevents both request starvation (too low) and OOM/resource exhaustion (too high).

Usage

Apply this heuristic when:

Deploying Ray Serve to production with significant concurrent traffic
Experiencing request timeouts or slow response times
Seeing controller connection errors or long-poll failures
Tuning `maxOngoingRequests` for deployments with varying request durations (short API calls vs long inference)

The Insight (Rule of Thumb)

Action: Set `maxOngoingRequests` based on your request duration profile. Use the default (100) for fast requests; lower it (e.g., 5-10) for long-running inference.
Value: Default `maxOngoingRequests=100`, `CONTROLLER_MAX_CONCURRENCY=15000`, `numReplicas=1`.
Trade-off: Higher `maxOngoingRequests` improves throughput for fast requests but risks OOM for memory-intensive inference. Lower values provide better backpressure but may underutilize replicas.
Controller sizing: `CONTROLLER_MAX_CONCURRENCY=15000` supports up to ~15,000 simultaneous deployment handles. If you exceed this, the controller becomes a bottleneck.
Constructor retries: `maxConstructorRetryCount=20` allows up to 20 retries for failed deployment initialization. Increase for deployments with flaky model loading.

Reasoning

The concurrency parameters form a two-level backpressure system:

Replica level: `maxOngoingRequests` limits how many requests a single replica processes concurrently. This prevents individual replicas from being overwhelmed. For CPU-bound inference, a value close to 1 may be optimal. For I/O-bound services, higher values (50-100) utilize concurrency better.

Controller level: The controller accepts one long-poll connection per deployment handle. With `O(num_handles)` connections, the concurrency must scale linearly. The 15,000 default supports large-scale deployments but can be a hard limit.

Router level: The `ReplicaSet` uses a wait loop (50 iterations x 20 microseconds = 1ms) before giving up on finding an available replica. This prevents hot-spinning while maintaining low latency.

Code Evidence

Per-replica concurrency limit from `DeploymentConfig.java:25-29`:

/**
 * The maximum number of requests that can be sent to a replica of this
 * deployment without receiving a response. Defaults to 100.
 */
private Integer maxOngoingRequests = 100;

Controller concurrency scaling from `Constants.java:36-40`:

/**
 * Because ServeController will accept one long poll request per handle,
 * its concurrency needs to scale as O(num_handles)
 */
public static final int CONTROLLER_MAX_CONCURRENCY = 15000;

Constructor retry limit from `DeploymentConfig.java:69`:

private Integer maxConstructorRetryCount = 20;

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment