Heuristic:Kserve Kserve Autoscaler Concurrency Target

Knowledge Sources	KServe KServe Autoscaling
Domains	Autoscaling, Performance
Last Updated	2026-02-13 14:00 GMT

Overview

Knative concurrency target is a soft limit with 60-second stable window; panic mode triggers at 2x target within 6 seconds.

Description

KServe supports four autoscaler backends: KPA (Knative Pod Autoscaler), HPA (Horizontal Pod Autoscaler), KEDA, and External. The KPA uses concurrency-based metrics with a two-window system: a 60-second stable window for normal scaling and a 6-second panic window that triggers rapid scale-up when load exceeds 2x the target.

Usage

Use this heuristic when configuring autoscaling annotations on InferenceService deployments. Choose the right autoscaler class and concurrency target based on your latency and cost requirements.

The Insight (Rule of Thumb)

Action: Set autoscaling annotations based on workload pattern:
- `autoscaling.knative.dev/target`: Soft concurrency limit per pod (KPA)
- `serving.kserve.io/targetUtilizationPercentage`: CPU utilization target (HPA)
- `serving.kserve.io/autoscalerClass`: Choose `kpa`, `hpa`, `keda`, or `external`
Value:
- Start with `target=1` for latency-sensitive inference
- Use `target=8` or higher for throughput-oriented batch inference
- Set `minReplicas >= 1` to avoid cold start delays
Trade-off: Lower targets = more pods = higher cost but lower latency. Higher targets = fewer pods = lower cost but higher p99 latency.
ContainerConcurrency: Higher CC (e.g., 8) reduces p99 at moderate load but increases tail latency variability at very high load (1000+ RPS).

Reasoning

The concurrency target is a soft limit - the autoscaler observes actual concurrency and adjusts replicas to stay near the target, but bursts can temporarily exceed it. The two-window system provides:

Stable window (60s): Prevents flapping by requiring sustained demand before scaling
Panic window (6s): Rapid scale-up when load spikes exceed 2x target

Benchmark evidence from `test/benchmark/README.md`:

ContainerConcurrency=8:
  5/s:    p50=6.2ms   p99=7.0ms   (100% success)
  500/s:  p50=4.1ms   p99=4.9ms   (100% success)
  1000/s: p50=398ms   p99=2.9s    (100% success, shows saturation)

ContainerConcurrency=1:
  1/s:    p50=104ms   p99=112ms   (100% success)
  10/s:   p50=702ms   p99=3.5s    (100% success, scaling lag visible)

Default autoscaler class from `pkg/constants/constants.go`:

DefaultAutoscalerClass = AutoscalerClassHPA

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment