Principle:Ollama Ollama GGUF Model Conversion GptOss

Knowledge Sources	Ollama
Domains	Model Conversion, GPT
Last Updated	2025-02-15 00:00 GMT

Overview

GPT-OSS conversion handles an open-source GPT variant architecture featuring SwiGLU activations, YaRN RoPE scaling, Mixture-of-Experts with MXFP4 quantized expert weights, and sliding window attention, supporting both HuggingFace-flavored and native model formats in the transformation to GGUF.

Core Concepts

Tensor Name Mapping

The converter supports two naming schemes depending on the model flavor:

HuggingFace flavor (when max_position_embeddings > 0):

lm_head -> output
model.embed_tokens -> token_embd
model.layers -> blk
model.norm -> output_norm
self_attn.{q,k,v}_proj -> attn_{q,k,v}
self_attn.o_proj -> attn_out
self_attn.sinks -> attn_sinks
mlp.router -> ffn_gate_inp
mlp.experts.gate_up_proj_ -> ffn_gate_up_exps.
mlp.experts.down_proj_ -> ffn_down_exps.

Native flavor:

block -> blk
embedding -> token_embd
unembedding -> output
attn.qkv -> attn_qkv
mlp.gate -> ffn_gate_inp
mlp.mlp1_ -> ffn_gate_up_exps.
mlp.mlp2_ -> ffn_down_exps.

Architecture-Specific Hyperparameters

The GGUF metadata is written under the gptoss.* namespace:

gptoss.context_length -- derived from max_position_embeddings or rope_scaling_factor * initial_context_length
gptoss.expert_count -- from num_experts or num_local_experts
gptoss.expert_used_count -- experts per token
gptoss.attention.key_length / value_length -- explicit head dimension
gptoss.attention.sliding_window -- sliding window size
gptoss.rope.freq_base -- RoPE theta
gptoss.rope.scaling.factor / original_context_length -- YaRN scaling
general.file_type -- set to 4

Special Handling

MXFP4 Expert Weight Handling

Expert weights may arrive as MXFP4 (Microscaling FP4) format with separate .blocks and .scales tensors. The converter pairs these together, performs a byte-level transformation to rearrange nibbles (interleaving high/low 4-bit values), concatenates scales with blocks along dimension 3, and outputs the result with TensorTypeMXFP4.

Gate-Up Expert Splitting

Interleaved gate_up_exps tensors are split into separate gate_exps and up_exps by striding along the expert dimension (even indices for gate, odd for up). This applies to both MXFP4 and regular float tensors, as well as bias tensors.

Custom Token IDs

The tokenizer sets specific token IDs:

BOS: 199998 (<|startoftext|>)
EOS: 199999 (<|endoftext|>)
Additional EOS tokens: 200002 (<|return|>), 200012 (<|call|>)
Both add_bos_token and add_eos_token are set to false.

Implementation Notes

The conversion is implemented in convert/convert_gptoss.go via the gptossModel struct. The mxfp4 struct implements io.WriterTo for custom serialization of MXFP4 tensors. The dual-flavor support uses a conditional in Replacements() based on whether HuggingFace-style config keys are present.

Related Pages

Implementation:Ollama_Ollama_Convert_GptOss

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment