Heuristic:SeldonIO Seldon core Autoscaling Dual Config Tip
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Scaling |
| Last Updated | 2026-02-13 14:00 GMT |
Overview
Critical autoscaling requirement: always configure both Model autoscaling AND Server autoscaling together, or model replicas will not find capacity.
Description
Seldon Core 2 has a two-level autoscaling system. Models define `minReplicas` and `maxReplicas` to control how many model replicas should exist. Servers define `MinReplicas` and `MaxReplicas` to control how many server pods should run. Model autoscaling uses inference lag (difference between incoming and outgoing requests) to trigger scale-up, and inactivity duration to trigger scale-down. However, if models scale up but server pods are not also configured to scale up, the new model replicas will have no server capacity to schedule onto.
Usage
Use this heuristic whenever configuring autoscaling for models in Seldon Core 2. This is a mandatory configuration pattern, not optional - failing to configure both levels will result in models stuck in `ScheduleFailed` state when they try to scale up.
The Insight (Rule of Thumb)
- Action: When configuring Model autoscaling (`minReplicas`/`maxReplicas`), always also configure Server autoscaling (`MinReplicas`/`MaxReplicas`).
- Value:
- Model: Set `minReplicas: 0` (scale-to-zero) or `minReplicas: 1` (always warm) with `maxReplicas` as ceiling.
- Server: Set `MinReplicas` >= 1 and `MaxReplicas` to accommodate peak model replica demand.
- Trade-off: Dual autoscaling adds configuration complexity but prevents stuck models.
- Monitoring parameters:
- `SELDON_MODEL_INFERENCE_LAG_THRESHOLD`: Lag threshold to trigger scale-up (default: 30 seconds)
- `SELDON_MODEL_INACTIVE_SECONDS_THRESHOLD`: Inactivity before scale-down (default: 30 seconds)
- `SELDON_SCALING_STATS_PERIOD_SECONDS`: Metrics check interval (default: 5 seconds)
Reasoning
From `docs-gb/scaling/core-autoscaling-models.md`:
"WARNING: If autoscaling models, MUST autoscale servers too. Otherwise required server replicas won't spin up even if desired model replicas can't be fulfilled."
The autoscaling system works as follows:
- Model inference lag exceeds threshold -> model requests more replicas
- Scheduler tries to place new model replicas on server pods
- If no server pods have capacity -> model stays in `ScheduleFailed`
- Server autoscaler sees demand and spins up new server pods
- Without server autoscaling configured, step 4 never happens
Additional constraints:
- Model must be in stable state for 5 minutes before scaling
- Statistics are checked every `SELDON_SCALING_STATS_PERIOD_SECONDS` (default 5)
- Scale-to-zero models use overcommit budget to manage eviction