Heuristic:SeldonIO Seldon core Autoscaling Dual Config Tip

Knowledge Sources	Model Autoscaling
Domains	Optimization, Scaling
Last Updated	2026-02-13 14:00 GMT

Overview

Critical autoscaling requirement: always configure both Model autoscaling AND Server autoscaling together, or model replicas will not find capacity.

Description

Seldon Core 2 has a two-level autoscaling system. Models define `minReplicas` and `maxReplicas` to control how many model replicas should exist. Servers define `MinReplicas` and `MaxReplicas` to control how many server pods should run. Model autoscaling uses inference lag (difference between incoming and outgoing requests) to trigger scale-up, and inactivity duration to trigger scale-down. However, if models scale up but server pods are not also configured to scale up, the new model replicas will have no server capacity to schedule onto.

Usage

Use this heuristic whenever configuring autoscaling for models in Seldon Core 2. This is a mandatory configuration pattern, not optional - failing to configure both levels will result in models stuck in `ScheduleFailed` state when they try to scale up.

The Insight (Rule of Thumb)

Action: When configuring Model autoscaling (`minReplicas`/`maxReplicas`), always also configure Server autoscaling (`MinReplicas`/`MaxReplicas`).
Value:
- Model: Set `minReplicas: 0` (scale-to-zero) or `minReplicas: 1` (always warm) with `maxReplicas` as ceiling.
- Server: Set `MinReplicas` >= 1 and `MaxReplicas` to accommodate peak model replica demand.
Trade-off: Dual autoscaling adds configuration complexity but prevents stuck models.
Monitoring parameters:
- `SELDON_MODEL_INFERENCE_LAG_THRESHOLD`: Lag threshold to trigger scale-up (default: 30 seconds)
- `SELDON_MODEL_INACTIVE_SECONDS_THRESHOLD`: Inactivity before scale-down (default: 30 seconds)
- `SELDON_SCALING_STATS_PERIOD_SECONDS`: Metrics check interval (default: 5 seconds)

Reasoning

From `docs-gb/scaling/core-autoscaling-models.md`:

"WARNING: If autoscaling models, MUST autoscale servers too. Otherwise required server replicas won't spin up even if desired model replicas can't be fulfilled."

The autoscaling system works as follows:

Model inference lag exceeds threshold -> model requests more replicas
Scheduler tries to place new model replicas on server pods
If no server pods have capacity -> model stays in `ScheduleFailed`
Server autoscaler sees demand and spins up new server pods
Without server autoscaling configured, step 4 never happens

Additional constraints:

Model must be in stable state for 5 minutes before scaling
Statistics are checked every `SELDON_SCALING_STATS_PERIOD_SECONDS` (default 5)
Scale-to-zero models use overcommit budget to manage eviction

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment