Principle:Ollama Ollama Model Architecture Support

Knowledge Sources	Ollama
Domains	Model Architecture, Multi-Architecture
Last Updated	2025-02-15 00:00 GMT

Overview

Model Architecture Support encompasses the implementation of over a dozen distinct transformer and non-transformer architectures within Ollama, each with its own attention mechanism, normalization strategy, position encoding, and feed-forward network configuration.

Core Concepts

Architecture Diversity

Modern LLMs exhibit significant architectural variation despite sharing the transformer backbone. Key differences include attention type (multi-head, grouped-query, multi-query, sliding window), normalization placement (pre-norm vs. post-norm, RMSNorm vs. LayerNorm), position encoding (rotary, absolute, ALiBi), and feed-forward structure (standard MLP, gated MLP, mixture-of-experts). Supporting this diversity requires each architecture to define its own forward pass while sharing common infrastructure.

Supported Architecture Families

Ollama supports architectures including but not limited to:

Llama family (Llama, Llama 4, Code Llama) - RMSNorm, rotary embeddings, grouped-query attention
Gemma family (Gemma 2, Gemma 3, Gemma 3n) - RMSNorm with unique scaling, logit soft-capping
Qwen family (Qwen 2, Qwen 2.5 VL, Qwen 3, Qwen 3 Next, Qwen 3 VL) - varying attention configurations, vision-language variants
Mistral family (Mistral 3) - sliding window attention
DeepSeek family (DeepSeek 2, DeepSeek OCR) - mixture-of-experts with shared experts
BERT family (BERT, NomicBERT) - bidirectional attention, encoder-only
Others (Olmo 3, GLM 4 MoE Lite, GLM OCR, GPT-OSS, LFM 2) - specialized configurations

Shared Layer Primitives

Despite architectural diversity, many components are shared across architectures. The ml/nn package provides reusable layer primitives such as linear projections, embedding lookups, and normalization operations. Architectures compose these primitives into their specific layer configurations, reducing code duplication while preserving architectural fidelity.

Vision-Language Extensions

Several architectures extend the base text transformer with vision encoders (CLIP, SigLIP) or audio encoders to handle multimodal inputs. These extensions add preprocessing pipelines for non-text modalities and cross-attention or projection layers that inject visual/audio features into the language model's representation space.

Implementation Notes

Each architecture is implemented in its own subdirectory under model/models/ (e.g., model/models/llama/, model/models/gemma3/, model/models/deepseek2/). The central registry in model/models/models.go maps GGUF architecture metadata strings to the corresponding constructor functions. Converter implementations in convert/ provide architecture-specific weight mapping for each supported model family.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment