Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:SeldonIO Seldon core Model Scheduling Preference Tip

From Leeroopedia
Knowledge Sources
Domains Optimization, Scheduling
Last Updated 2026-02-13 14:00 GMT

Overview

Model scheduling heuristic that prefers keeping models on their current server to avoid costly reloads and flip-flops between candidate servers.

Description

Seldon Core 2's scheduler assigns models to inference servers based on four criteria evaluated in order: capability matching, replica availability, memory capacity, and server affinity (preference for the current server). The fourth criterion is a deliberate anti-flip-flop measure: when rescheduling, the system prefers to keep a model on its existing server rather than migrating it. This prevents costly model reload cycles when multiple servers are equally suitable.

Usage

Be aware of this heuristic when troubleshooting model placement or planning server capacity. If a model is on a seemingly suboptimal server, it may be intentionally staying there to avoid a reload. This is also relevant when scaling down servers - models on removed replicas will be rescheduled, but the scheduler tries to minimize disruption.

The Insight (Rule of Thumb)

  • Action: Understand the four scheduling criteria in priority order:
    1. Server has matching capabilities with Model `spec.requirements`
    2. Server has enough replicas for desired Model `spec.replicas`
    3. Each replica has enough available memory for Model `spec.memory`
    4. Server already hosting the model is preferred (anti-flip-flop)
  • Value: N/A (built-in behavior, not configurable).
  • Trade-off: Models may remain on less-optimal servers to avoid reload costs. A model can only be assigned to at most one Server.
  • Partial scheduling (v2.9+): If `minReplicas` is defined, the system can partially schedule a model (fewer replicas than desired) with `ModelAvailable` status.

Reasoning

Model loading is an expensive operation (downloading artifacts, deserializing weights, allocating GPU memory). Frequent server reassignment (flip-flopping) wastes these resources and introduces downtime during transitions. The anti-flip-flop preference ensures stable placement once a model is running.

From `docs-gb/models/scheduling.md`: "Server that already hosts the Model is preferred to reduce flip-flops between different candidate servers."

Known limitation from `scheduler/pkg/store/memory.go:725`:

// TODO we should not reschedule models on servers with dedicated models,
// e.g. non shareable servers

When a server replica is removed, models are rescheduled even on dedicated (non-shareable) servers, which is a known issue.

Key constraint: A specific Model can only be assigned to at most ONE Server. That Server must have enough replicas for all desired model replicas.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment