Principle:Ollama Ollama GGUF Model Conversion Qwen3Next

Knowledge Sources	Ollama
Domains	Model Conversion, Qwen
Last Updated	2025-02-15 00:00 GMT

Overview

Qwen 3 Next generation conversion handles a hybrid architecture combining full attention layers with Gated Delta Net linear attention (recurrent) layers, Mixture-of-Experts with shared experts, partial rotary embeddings, and convolution operations, transforming the complete model from HuggingFace SafeTensors to GGUF format with extensive tensor splitting and transformation.

Core Concepts

Tensor Name Mapping

The converter applies the following HuggingFace-to-GGUF tensor name replacements:

Embeddings and output:

lm_head -> output
model.embed_tokens -> token_embd
model.norm -> output_norm
model.layers -> blk

Full attention:

self_attn.{q,k,v}_proj -> attn_{q,k,v}
self_attn.{q,k}_norm -> attn_{q,k}_norm
self_attn.o_proj -> attn_output

Linear attention (Gated Delta Net):

linear_attn.in_proj_qkvz -> ssm_in (then split)
linear_attn.in_proj_ba -> ssm_ba
linear_attn.conv1d -> ssm_conv1d
linear_attn.dt_bias / dt_proj -> ssm_dt
linear_attn.A_log -> ssm_a
linear_attn.norm -> ssm_norm
linear_attn.out_proj -> ssm_out

MoE:

mlp.gate.weight -> ffn_gate_inp.weight
mlp.shared_expert.{down,gate,up}_proj -> ffn_{down,gate,up}_shexp
mlp.shared_expert_gate -> ffn_gate_inp_shexp

Architecture-Specific Hyperparameters

The GGUF metadata is written under the qwen3next architecture:

Core:

block_count, context_length, embedding_length, feed_forward_length
attention.head_count, key_length, value_length
attention.layer_norm_rms_epsilon
attention.head_count_kv -- per-layer array (0 for recurrent, num_kv_heads for attention layers)
full_attention_interval -- how often full attention layers occur

RoPE:

rope.freq_base, rope.dimension_count (partial rotary)
rope.scaling.type, rope.scaling.factor

SSM/Linear attention:

ssm.inner_size -- value_head_dim * num_value_heads
ssm.state_size -- key head dimension
ssm.group_count -- number of key heads
ssm.time_step_rank -- number of value heads
ssm.conv_kernel -- convolution kernel dimension

MoE:

expert_count, expert_used_count, norm_top_k_prob
expert_feed_forward_length, expert_shared_feed_forward_length

Special Handling

QKVZ Tensor Splitting

The fused ssm_in tensor (containing Q, K, V, and Z gate projections) is split into two output tensors:

attn_qkv -- concatenation of Q, K, V projections
attn_gate -- Z gate projection

The split logic reshapes to [hidden, num_k_heads, qkvz_dim], slices each component (Q: head_k_dim, K: head_k_dim, V: v_per_head, Z: v_per_head), reshapes Q/K/V into contiguous blocks, and concatenates Q+K+V. The Z component becomes the gate tensor.

A_log to SSM_A Transformation

The linear_attn.A_log tensor is transformed by computing -exp(A_log) at conversion time, pre-computing the negated exponential for runtime efficiency.

Normalization Weight Offset

All *_norm.weight tensors (except ssm_norm) have 1.0 added to their values, following the Gemma convention where norm weights are zero-centered.

Conv1d Weight Squeezing

Convolution weights with 3D shapes containing a singleton dimension ([1, D, K] or [D, 1, K]) are squeezed to 2D [D, K].

Shared Expert Gate Squeezing

The ffn_gate_inp_shexp weight with shape [D, 1] or [1, D] is squeezed to 1D [D].

Expert Tensor Merging

Individual expert tensors are merged into stacked tensors for gate, up, and down projections per layer.

Partial Rotary Factor

Only a fraction of the head dimension uses RoPE, determined by partial_rotary_factor. The rope.dimension_count is computed as head_dim * partial_rotary_factor.

Strict Validation

The parseMore method performs extensive validation of config parameters, ensuring required fields are present and that the full_attention_interval produces at least one full attention layer.

Implementation Notes

The conversion is implemented in convert/convert_qwen3next.go via the qwen3NextModel struct. The qkvzSplitSpec struct encapsulates the dimension calculations for the QKVZ split. The tokenizer pre-processor is set to qwen2. This is one of the most complex converters due to the hybrid attention-recurrence architecture and extensive tensor transformations.

Related Pages

Implementation:Ollama_Ollama_Convert_Qwen3Next

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment