Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Heuristic:Ollama Ollama Multimodal Parallel Restriction

From Leeroopedia
Knowledge Sources
Domains LLMs, Inference, Debugging
Last Updated 2026-02-14 22:00 GMT

Overview

Safety restriction that forces certain model architectures (mllama, qwen3vl, qwen3vlmoe, qwen3next, lfm2, lfm2moe) to run with num_parallel=1 to prevent inference corruption.

Description

Several model architectures in Ollama are not safe with concurrent parallel requests (num_parallel > 1). When the scheduler detects one of these architectures, it overrides the user's `OLLAMA_NUM_PARALLEL` setting and forces single-request execution. This prevents data corruption in shared inference state (cross-attention caches, vision token buffers) that occurs when multiple requests interleave.

Additionally, embedding models (models that lack completion capability) are always forced to num_parallel=1, and multimodal models (those with vision capabilities) require a minimum context length of 2048 tokens regardless of user settings.

Usage

This heuristic is critical when serving multimodal or hybrid-architecture models in production. If you observe degraded output quality or crashes with certain models under concurrent load, check whether the model architecture is on the restricted list. The scheduler logs a warning when it overrides the parallel setting.

The Insight (Rule of Thumb)

  • Action: Force `num_parallel=1` for architectures: `mllama`, `qwen3vl`, `qwen3vlmoe`, `qwen3next`, `lfm2`, `lfm2moe`.
  • Action: Force `num_parallel=1` for all embedding-only models.
  • Action: Force minimum `NumCtx=2048` for all multimodal (vision-capable) models.
  • Action: Force minimum `NumCtx=4` for all models (absolute floor).
  • Value: Prevents inference corruption and crashes.
  • Trade-off: Reduced throughput for affected models. Users cannot parallelize requests to these architectures even on high-end hardware.

Reasoning

These architectures have shared mutable state that is not safe for concurrent access:

  • mllama (Multimodal LLaMA): Uses cross-attention layers for vision tokens that maintain state across the forward pass. Concurrent requests would corrupt the cross-attention cache.
  • qwen3vl/qwen3vlmoe: Vision-language models with shared vision encoder state.
  • qwen3next: Hybrid architecture with Gated Delta Net linear attention layers that maintain recurrent state.
  • lfm2/lfm2moe: Models with short convolution (shortconv) layers that maintain temporal state.

This was discovered through production issues and tracked in GitHub issue #4165.

Architecture restriction from `server/sched.go:423-428`:

// Some architectures are not safe with num_parallel > 1.
// ref: https://github.com/ollama/ollama/issues/4165
if slices.Contains([]string{"mllama", "qwen3vl", "qwen3vlmoe",
    "qwen3next", "lfm2", "lfm2moe"},
    req.model.Config.ModelFamily) && numParallel != 1 {
    numParallel = 1
    slog.Warn("model architecture does not currently support parallel requests",
        "architecture", req.model.Config.ModelFamily)
}

Embedding model restriction from `server/sched.go:418-421`:

// Embedding models should always be loaded with parallel=1
if req.model.CheckCapabilities(model.CapabilityCompletion) != nil {
    numParallel = 1
}

Multimodal minimum context from `server/sched.go:92-95`:

if m.CheckCapabilities(model.CapabilityVision) == nil {
    // multimodal models require at least 2048 context
    opts.NumCtx = max(opts.NumCtx, 2048)
}

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment