Principle:Ollama Ollama GGUF Model Conversion Qwen3
| Knowledge Sources | |
|---|---|
| Domains | Model Conversion, Qwen |
| Last Updated | 2025-02-15 00:00 GMT |
Overview
Qwen 3 conversion handles the Alibaba Qwen 3 architecture in both standard dense and Mixture-of-Experts variants, transforming the model from HuggingFace SafeTensors to GGUF format with QK normalization, fused gate-up expert projection splitting, expert tensor transposition, and support for YaRN and M-RoPE scaling.
Core Concepts
Tensor Name Mapping
The converter applies the following HuggingFace-to-GGUF tensor name replacements:
lm_head->outputmodel.embed_tokens->token_embdmodel.layers->blkmodel.norm->output_normself_attn.k_proj->attn_kself_attn.k_norm->attn_k_normself_attn.v_proj->attn_vself_attn.q_proj->attn_qself_attn.q_norm->attn_q_normself_attn.o_proj->attn_outputmlp.{down,gate,up}_proj->ffn_{down,gate,up}mlp.gate.weight->ffn_gate_inp.weight(MoE router)mlp.experts.down_proj->ffn_down_exps.weightmlp.experts.gate_up_proj->ffn_gate_up_exps.weightpost_attention_layernorm->ffn_norm
Architecture-Specific Hyperparameters
The GGUF metadata uses a dynamic architecture prefix (qwen3 for dense, qwen3moe for MoE):
block_count,context_length,embedding_length,feed_forward_lengthattention.head_count,head_count_kvattention.key_length/value_length-- explicit head dimensionattention.layer_norm_rms_epsilon-- RMSNorm epsilonrope.freq_base-- RoPE theta
MoE parameters (when num_experts > 0):
expert_count,expert_used_countnorm_top_k_prob-- whether to normalize top-K probabilities
RoPE scaling:
rope.scaling.type-- "yarn" for YaRNrope.scaling.factor-- scaling factor arrayrope.mrope_section-- M-RoPE section sizes (for "mrope"/"default" types)
Special Handling
Dynamic Architecture Selection
The GGUF architecture identifier is dynamically set based on whether MoE parameters are present: qwen3 for dense models, qwen3moe for MoE variants.
Fused Gate-Up Expert Splitting and Transposition
MoE gate_up_exps tensors are split along dimension 2 into separate gate and up tensors. Each half is then transposed (dimensions 0, 2, 1 swapped) and the output shape is adjusted to reflect the transposition. This reorders from [experts, hidden, 2*intermediate] to [experts, intermediate, hidden] for each half.
Down Expert Transposition
MoE down_exps tensors are transposed from [experts, intermediate, hidden] to [experts, hidden, intermediate].
QK Normalization
Qwen 3 uses separate Q and K normalization layers, mapped to attn_q_norm and attn_k_norm in GGUF.
M-RoPE Support
When the RoPE scaling type is "mrope" or "default", the mrope_section array (specifying the dimension allocation for temporal, height, and width components) is stored in GGUF metadata.
Implementation Notes
The conversion is implemented in convert/convert_qwen3.go via the qwen3Model struct. The expert splitting uses the splitDim iterator with an afterFunc callback to apply the transposition. This struct also serves as the base type for the Qwen 3 VL multimodal variant. The ropeFactor type from Phi-3 is reused for YaRN scaling factors.