Heuristic:Ollama Ollama Multimodal Parallel Restriction
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Inference, Debugging |
| Last Updated | 2026-02-14 22:00 GMT |
Overview
Safety restriction that forces certain model architectures (mllama, qwen3vl, qwen3vlmoe, qwen3next, lfm2, lfm2moe) to run with num_parallel=1 to prevent inference corruption.
Description
Several model architectures in Ollama are not safe with concurrent parallel requests (num_parallel > 1). When the scheduler detects one of these architectures, it overrides the user's `OLLAMA_NUM_PARALLEL` setting and forces single-request execution. This prevents data corruption in shared inference state (cross-attention caches, vision token buffers) that occurs when multiple requests interleave.
Additionally, embedding models (models that lack completion capability) are always forced to num_parallel=1, and multimodal models (those with vision capabilities) require a minimum context length of 2048 tokens regardless of user settings.
Usage
This heuristic is critical when serving multimodal or hybrid-architecture models in production. If you observe degraded output quality or crashes with certain models under concurrent load, check whether the model architecture is on the restricted list. The scheduler logs a warning when it overrides the parallel setting.
The Insight (Rule of Thumb)
- Action: Force `num_parallel=1` for architectures: `mllama`, `qwen3vl`, `qwen3vlmoe`, `qwen3next`, `lfm2`, `lfm2moe`.
- Action: Force `num_parallel=1` for all embedding-only models.
- Action: Force minimum `NumCtx=2048` for all multimodal (vision-capable) models.
- Action: Force minimum `NumCtx=4` for all models (absolute floor).
- Value: Prevents inference corruption and crashes.
- Trade-off: Reduced throughput for affected models. Users cannot parallelize requests to these architectures even on high-end hardware.
Reasoning
These architectures have shared mutable state that is not safe for concurrent access:
- mllama (Multimodal LLaMA): Uses cross-attention layers for vision tokens that maintain state across the forward pass. Concurrent requests would corrupt the cross-attention cache.
- qwen3vl/qwen3vlmoe: Vision-language models with shared vision encoder state.
- qwen3next: Hybrid architecture with Gated Delta Net linear attention layers that maintain recurrent state.
- lfm2/lfm2moe: Models with short convolution (shortconv) layers that maintain temporal state.
This was discovered through production issues and tracked in GitHub issue #4165.
Architecture restriction from `server/sched.go:423-428`:
// Some architectures are not safe with num_parallel > 1.
// ref: https://github.com/ollama/ollama/issues/4165
if slices.Contains([]string{"mllama", "qwen3vl", "qwen3vlmoe",
"qwen3next", "lfm2", "lfm2moe"},
req.model.Config.ModelFamily) && numParallel != 1 {
numParallel = 1
slog.Warn("model architecture does not currently support parallel requests",
"architecture", req.model.Config.ModelFamily)
}
Embedding model restriction from `server/sched.go:418-421`:
// Embedding models should always be loaded with parallel=1
if req.model.CheckCapabilities(model.CapabilityCompletion) != nil {
numParallel = 1
}
Multimodal minimum context from `server/sched.go:92-95`:
if m.CheckCapabilities(model.CapabilityVision) == nil {
// multimodal models require at least 2048 context
opts.NumCtx = max(opts.NumCtx, 2048)
}