Heuristic:Ray project Ray Serve Concurrency And Backpressure
| Knowledge Sources | |
|---|---|
| Domains | Serve, Optimization, Backpressure |
| Last Updated | 2026-02-13 16:35 GMT |
Overview
Concurrency tuning strategy for Ray Serve: set `maxOngoingRequests` to 100 per replica (default) and controller concurrency to 15,000 to handle O(num_handles) long-poll connections.
Description
Ray Serve uses two key concurrency parameters that directly affect system stability and throughput. The per-replica `maxOngoingRequests` (default 100) controls backpressure: when a replica reaches this limit, new requests are queued at the router level. The controller's `CONTROLLER_MAX_CONCURRENCY` (15,000) must scale with the number of deployment handles because each handle maintains one long-poll connection to the controller. Understanding these parameters prevents both request starvation (too low) and OOM/resource exhaustion (too high).
Usage
Apply this heuristic when:
- Deploying Ray Serve to production with significant concurrent traffic
- Experiencing request timeouts or slow response times
- Seeing controller connection errors or long-poll failures
- Tuning `maxOngoingRequests` for deployments with varying request durations (short API calls vs long inference)
The Insight (Rule of Thumb)
- Action: Set `maxOngoingRequests` based on your request duration profile. Use the default (100) for fast requests; lower it (e.g., 5-10) for long-running inference.
- Value: Default `maxOngoingRequests=100`, `CONTROLLER_MAX_CONCURRENCY=15000`, `numReplicas=1`.
- Trade-off: Higher `maxOngoingRequests` improves throughput for fast requests but risks OOM for memory-intensive inference. Lower values provide better backpressure but may underutilize replicas.
- Controller sizing: `CONTROLLER_MAX_CONCURRENCY=15000` supports up to ~15,000 simultaneous deployment handles. If you exceed this, the controller becomes a bottleneck.
- Constructor retries: `maxConstructorRetryCount=20` allows up to 20 retries for failed deployment initialization. Increase for deployments with flaky model loading.
Reasoning
The concurrency parameters form a two-level backpressure system:
- Replica level: `maxOngoingRequests` limits how many requests a single replica processes concurrently. This prevents individual replicas from being overwhelmed. For CPU-bound inference, a value close to 1 may be optimal. For I/O-bound services, higher values (50-100) utilize concurrency better.
- Controller level: The controller accepts one long-poll connection per deployment handle. With `O(num_handles)` connections, the concurrency must scale linearly. The 15,000 default supports large-scale deployments but can be a hard limit.
- Router level: The `ReplicaSet` uses a wait loop (50 iterations x 20 microseconds = 1ms) before giving up on finding an available replica. This prevents hot-spinning while maintaining low latency.
Code Evidence
Per-replica concurrency limit from `DeploymentConfig.java:25-29`:
/**
* The maximum number of requests that can be sent to a replica of this
* deployment without receiving a response. Defaults to 100.
*/
private Integer maxOngoingRequests = 100;
Controller concurrency scaling from `Constants.java:36-40`:
/**
* Because ServeController will accept one long poll request per handle,
* its concurrency needs to scale as O(num_handles)
*/
public static final int CONTROLLER_MAX_CONCURRENCY = 15000;
Constructor retry limit from `DeploymentConfig.java:69`:
private Integer maxConstructorRetryCount = 20;