Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:SeldonIO Seldon core Over Commit Memory Tip

From Leeroopedia
Knowledge Sources
Domains Optimization, Multi_Model_Serving
Last Updated 2026-02-13 14:00 GMT

Overview

Memory optimization technique using LRU-based model eviction with configurable over-commit percentage to host 10x+ models beyond active memory capacity.

Description

Seldon Core 2 supports Multi-Model Serving (MMS) where a single inference server hosts multiple models simultaneously. The over-commit feature allows loading more models than can fit in active memory by using an LRU (Least Recently Used) cache eviction strategy. When memory is full, the least recently used model is evicted to disk. When an evicted model receives an inference request, it is automatically reloaded (with ~100ms latency). This trades a small latency penalty for dramatically higher model density.

Usage

Use this heuristic when you need to maximize model density on inference servers, especially when many models have low or sporadic traffic. It is standard practice when deploying dozens or hundreds of models on shared infrastructure. The default over-commit percentage of 10% is conservative; increase it for higher density with models that tolerate reload latency.

The Insight (Rule of Thumb)

  • Action: Set `SELDON_OVERCOMMIT_PERCENTAGE` environment variable on the agent to control memory over-commit budget.
  • Value: Default is `10` (10% over-commit). For higher density, increase to `20-50`. Set to `0` to disable.
  • Trade-off: Higher over-commit = more models hosted, but evicted models incur ~100ms reload latency on next request.
  • Capacity formula: Total capacity = `MEMORY_REQUEST * (1 + OVERCOMMIT_PERCENTAGE/100)`.
  • Example: 10MB memory + 20% over-commit = 12MB total. Can host 12 x 1MB models with 10 active and 2 evictable.

Reasoning

The over-commit mechanism leverages the observation that in multi-model deployments, most models follow a long-tail traffic pattern - a small number of models receive the majority of requests while many models are rarely invoked. Keeping all models loaded in memory is wasteful. The LRU eviction strategy ensures that frequently-used models stay in memory while idle models are evicted, with automatic reload on demand.

Empirical evidence from `samples/local-over-commit-test.md`:

  • 1 server replica with 10MB memory slots + 20% over-commit = 12MB capacity
  • 11 iris models loaded @ 1MB each (11MB < 12MB capacity) - all succeed
  • 10 models active in memory, 1 evicted to disk
  • All 11 models serve inference requests successfully
  • Evicted model reloads in ~100ms on first request

Key limitations:

  • The LRU cache uses O(n) linear scans (TODO comments in `scheduler/pkg/agent/cache/lru_cache_manager.go` note this should be optimized)
  • Not recommended for latency-sensitive models that cannot tolerate ~100ms reloads
  • Model load retries are limited to 5 attempts with a 120-minute total timeout

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment