Principle:Ollama Ollama GGUF Model Conversion Qwen3

Knowledge Sources	Ollama
Domains	Model Conversion, Qwen
Last Updated	2025-02-15 00:00 GMT

Overview

Qwen 3 conversion handles the Alibaba Qwen 3 architecture in both standard dense and Mixture-of-Experts variants, transforming the model from HuggingFace SafeTensors to GGUF format with QK normalization, fused gate-up expert projection splitting, expert tensor transposition, and support for YaRN and M-RoPE scaling.

Core Concepts

Tensor Name Mapping

The converter applies the following HuggingFace-to-GGUF tensor name replacements:

lm_head -> output
model.embed_tokens -> token_embd
model.layers -> blk
model.norm -> output_norm
self_attn.k_proj -> attn_k
self_attn.k_norm -> attn_k_norm
self_attn.v_proj -> attn_v
self_attn.q_proj -> attn_q
self_attn.q_norm -> attn_q_norm
self_attn.o_proj -> attn_output
mlp.{down,gate,up}_proj -> ffn_{down,gate,up}
mlp.gate.weight -> ffn_gate_inp.weight (MoE router)
mlp.experts.down_proj -> ffn_down_exps.weight
mlp.experts.gate_up_proj -> ffn_gate_up_exps.weight
post_attention_layernorm -> ffn_norm

Architecture-Specific Hyperparameters

The GGUF metadata uses a dynamic architecture prefix (qwen3 for dense, qwen3moe for MoE):

block_count, context_length, embedding_length, feed_forward_length
attention.head_count, head_count_kv
attention.key_length / value_length -- explicit head dimension
attention.layer_norm_rms_epsilon -- RMSNorm epsilon
rope.freq_base -- RoPE theta

MoE parameters (when num_experts > 0):

expert_count, expert_used_count
norm_top_k_prob -- whether to normalize top-K probabilities

RoPE scaling:

rope.scaling.type -- "yarn" for YaRN
rope.scaling.factor -- scaling factor array
rope.mrope_section -- M-RoPE section sizes (for "mrope"/"default" types)

Special Handling

Dynamic Architecture Selection

The GGUF architecture identifier is dynamically set based on whether MoE parameters are present: qwen3 for dense models, qwen3moe for MoE variants.

Fused Gate-Up Expert Splitting and Transposition

MoE gate_up_exps tensors are split along dimension 2 into separate gate and up tensors. Each half is then transposed (dimensions 0, 2, 1 swapped) and the output shape is adjusted to reflect the transposition. This reorders from [experts, hidden, 2*intermediate] to [experts, intermediate, hidden] for each half.

Down Expert Transposition

MoE down_exps tensors are transposed from [experts, intermediate, hidden] to [experts, hidden, intermediate].

QK Normalization

Qwen 3 uses separate Q and K normalization layers, mapped to attn_q_norm and attn_k_norm in GGUF.

M-RoPE Support

When the RoPE scaling type is "mrope" or "default", the mrope_section array (specifying the dimension allocation for temporal, height, and width components) is stored in GGUF metadata.

Implementation Notes

The conversion is implemented in convert/convert_qwen3.go via the qwen3Model struct. The expert splitting uses the splitDim iterator with an afterFunc callback to apply the transposition. This struct also serves as the base type for the Qwen 3 VL multimodal variant. The ropeFactor type from Phi-3 is reused for YaRN scaling factors.

Related Pages

Implementation:Ollama_Ollama_Convert_Qwen3

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment