Heuristic:Ollama Ollama VRAM Recovery And Scheduling
| Knowledge Sources | |
|---|---|
| Domains | GPU, Optimization, Scheduling |
| Last Updated | 2026-02-14 22:00 GMT |
Overview
GPU VRAM recovery heuristic that polls memory every 250ms and considers recovery complete when 75% of expected freed VRAM is available, with a 5-second timeout.
Description
When the Ollama scheduler unloads a model to make room for a new one, the GPU VRAM does not become available instantly. The operating system and GPU driver need time to reclaim the memory. This heuristic implements a polling-based convergence check that monitors GPU free memory at 250ms intervals and considers the VRAM "recovered" once 75% of the expected freed memory is detected. If convergence does not occur within 5 seconds, the scheduler proceeds with its own memory estimates.
Additionally, the scheduler limits the number of concurrently loaded models to 3 per GPU by default, even if more models would fit, because loading many small models on a large GPU can cause stalling behavior.
Usage
This heuristic applies during model eviction and reload scenarios. When a user requests a model that is not currently loaded and all GPU slots are occupied, the scheduler must unload an existing model and wait for VRAM recovery before loading the new one. Understanding these timing parameters is critical for tuning multi-model serving configurations.
The Insight (Rule of Thumb)
- Action: After unloading a model, poll GPU VRAM every 250ms until 75% of expected freed memory is recovered.
- Value: Convergence typically occurs in 0.5-1.5 seconds. Timeout at 5 seconds.
- Trade-off: If VRAM does not converge within 5 seconds, the scheduler trusts its own memory estimates rather than GPU-reported values. This avoids indefinite blocking but may lead to slightly inaccurate memory predictions.
- Tuning: Set `OLLAMA_MAX_LOADED_MODELS` to limit concurrent models. Set `OLLAMA_GPU_OVERHEAD` to reserve additional VRAM per GPU for system overhead.
- Default models per GPU: Capped at 3 to prevent stalling, regardless of available VRAM.
Reasoning
GPU drivers (especially CUDA) have laggy free memory reporting. After `free()` or `cudaFree()`, the reported free memory does not update instantly. The 75% threshold balances between waiting for perfect convergence and proceeding quickly. Typical convergence is 0.5-1.5 seconds based on production observations documented in code comments.
The 3-model-per-GPU limit exists because even when models fit in VRAM, the overhead of managing many concurrent inference contexts (KV caches, compute graphs) creates contention that degrades throughput. This is a practical limit discovered through real-world usage patterns.
From `server/sched.go:657`:
// TODO maybe we should just always trust our numbers, since
// cuda's free memory reporting is laggy
VRAM convergence polling from `server/sched.go:776-798`:
// typical convergence is 0.5-1.5s
ticker := time.NewTicker(250 * time.Millisecond)
// ...
// If we're within ~75% of the estimated memory usage recovered, bail out
if float32(freeMemoryNow-freeMemoryBefore) > float32(runner.vramSize)*0.75 {
slog.Debug(fmt.Sprintf("gpu VRAM free memory converged after %0.2f seconds",
time.Since(start).Seconds()))
finished <- struct{}{}
return
}
Default models per GPU from `server/sched.go:62-65`:
// Default automatic value for number of models we allow per GPU
// Model will still need to fit in VRAM, but loading many small models
// on a large GPU can cause stalling
var defaultModelsPerGPU = 3
Runner ping timeout variation from `server/sched.go:716-719`:
timeout := 10 * time.Second
if runner.loading {
timeout = 2 * time.Minute // Initial load can take a long time...
}