Heuristic:Kserve Kserve Autoscaler Concurrency Target
| Knowledge Sources | |
|---|---|
| Domains | Autoscaling, Performance |
| Last Updated | 2026-02-13 14:00 GMT |
Overview
Knative concurrency target is a soft limit with 60-second stable window; panic mode triggers at 2x target within 6 seconds.
Description
KServe supports four autoscaler backends: KPA (Knative Pod Autoscaler), HPA (Horizontal Pod Autoscaler), KEDA, and External. The KPA uses concurrency-based metrics with a two-window system: a 60-second stable window for normal scaling and a 6-second panic window that triggers rapid scale-up when load exceeds 2x the target.
Usage
Use this heuristic when configuring autoscaling annotations on InferenceService deployments. Choose the right autoscaler class and concurrency target based on your latency and cost requirements.
The Insight (Rule of Thumb)
- Action: Set autoscaling annotations based on workload pattern:
- `autoscaling.knative.dev/target`: Soft concurrency limit per pod (KPA)
- `serving.kserve.io/targetUtilizationPercentage`: CPU utilization target (HPA)
- `serving.kserve.io/autoscalerClass`: Choose `kpa`, `hpa`, `keda`, or `external`
- Value:
- Start with `target=1` for latency-sensitive inference
- Use `target=8` or higher for throughput-oriented batch inference
- Set `minReplicas >= 1` to avoid cold start delays
- Trade-off: Lower targets = more pods = higher cost but lower latency. Higher targets = fewer pods = lower cost but higher p99 latency.
- ContainerConcurrency: Higher CC (e.g., 8) reduces p99 at moderate load but increases tail latency variability at very high load (1000+ RPS).
Reasoning
The concurrency target is a soft limit - the autoscaler observes actual concurrency and adjusts replicas to stay near the target, but bursts can temporarily exceed it. The two-window system provides:
- Stable window (60s): Prevents flapping by requiring sustained demand before scaling
- Panic window (6s): Rapid scale-up when load spikes exceed 2x target
Benchmark evidence from `test/benchmark/README.md`:
ContainerConcurrency=8:
5/s: p50=6.2ms p99=7.0ms (100% success)
500/s: p50=4.1ms p99=4.9ms (100% success)
1000/s: p50=398ms p99=2.9s (100% success, shows saturation)
ContainerConcurrency=1:
1/s: p50=104ms p99=112ms (100% success)
10/s: p50=702ms p99=3.5s (100% success, scaling lag visible)
Default autoscaler class from `pkg/constants/constants.go`:
DefaultAutoscalerClass = AutoscalerClassHPA