Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Kserve Kserve Autoscaler Concurrency Target

From Leeroopedia
Knowledge Sources
Domains Autoscaling, Performance
Last Updated 2026-02-13 14:00 GMT

Overview

Knative concurrency target is a soft limit with 60-second stable window; panic mode triggers at 2x target within 6 seconds.

Description

KServe supports four autoscaler backends: KPA (Knative Pod Autoscaler), HPA (Horizontal Pod Autoscaler), KEDA, and External. The KPA uses concurrency-based metrics with a two-window system: a 60-second stable window for normal scaling and a 6-second panic window that triggers rapid scale-up when load exceeds 2x the target.

Usage

Use this heuristic when configuring autoscaling annotations on InferenceService deployments. Choose the right autoscaler class and concurrency target based on your latency and cost requirements.

The Insight (Rule of Thumb)

  • Action: Set autoscaling annotations based on workload pattern:
    • `autoscaling.knative.dev/target`: Soft concurrency limit per pod (KPA)
    • `serving.kserve.io/targetUtilizationPercentage`: CPU utilization target (HPA)
    • `serving.kserve.io/autoscalerClass`: Choose `kpa`, `hpa`, `keda`, or `external`
  • Value:
    • Start with `target=1` for latency-sensitive inference
    • Use `target=8` or higher for throughput-oriented batch inference
    • Set `minReplicas >= 1` to avoid cold start delays
  • Trade-off: Lower targets = more pods = higher cost but lower latency. Higher targets = fewer pods = lower cost but higher p99 latency.
  • ContainerConcurrency: Higher CC (e.g., 8) reduces p99 at moderate load but increases tail latency variability at very high load (1000+ RPS).

Reasoning

The concurrency target is a soft limit - the autoscaler observes actual concurrency and adjusts replicas to stay near the target, but bursts can temporarily exceed it. The two-window system provides:

  • Stable window (60s): Prevents flapping by requiring sustained demand before scaling
  • Panic window (6s): Rapid scale-up when load spikes exceed 2x target

Benchmark evidence from `test/benchmark/README.md`:

ContainerConcurrency=8:
  5/s:    p50=6.2ms   p99=7.0ms   (100% success)
  500/s:  p50=4.1ms   p99=4.9ms   (100% success)
  1000/s: p50=398ms   p99=2.9s    (100% success, shows saturation)

ContainerConcurrency=1:
  1/s:    p50=104ms   p99=112ms   (100% success)
  10/s:   p50=702ms   p99=3.5s    (100% success, scaling lag visible)

Default autoscaler class from `pkg/constants/constants.go`:

DefaultAutoscalerClass = AutoscalerClassHPA

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment