Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Ollama Ollama GGUF Model Conversion GptOss

From Leeroopedia
Revision as of 18:15, 16 February 2026 by Admin (talk | contribs) (Auto-imported from principles/Ollama_Ollama_GGUF_Model_Conversion_GptOss.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains Model Conversion, GPT
Last Updated 2025-02-15 00:00 GMT

Overview

GPT-OSS conversion handles an open-source GPT variant architecture featuring SwiGLU activations, YaRN RoPE scaling, Mixture-of-Experts with MXFP4 quantized expert weights, and sliding window attention, supporting both HuggingFace-flavored and native model formats in the transformation to GGUF.

Core Concepts

Tensor Name Mapping

The converter supports two naming schemes depending on the model flavor:

HuggingFace flavor (when max_position_embeddings > 0):

  • lm_head -> output
  • model.embed_tokens -> token_embd
  • model.layers -> blk
  • model.norm -> output_norm
  • self_attn.{q,k,v}_proj -> attn_{q,k,v}
  • self_attn.o_proj -> attn_out
  • self_attn.sinks -> attn_sinks
  • mlp.router -> ffn_gate_inp
  • mlp.experts.gate_up_proj_ -> ffn_gate_up_exps.
  • mlp.experts.down_proj_ -> ffn_down_exps.

Native flavor:

  • block -> blk
  • embedding -> token_embd
  • unembedding -> output
  • attn.qkv -> attn_qkv
  • mlp.gate -> ffn_gate_inp
  • mlp.mlp1_ -> ffn_gate_up_exps.
  • mlp.mlp2_ -> ffn_down_exps.

Architecture-Specific Hyperparameters

The GGUF metadata is written under the gptoss.* namespace:

  • gptoss.context_length -- derived from max_position_embeddings or rope_scaling_factor * initial_context_length
  • gptoss.expert_count -- from num_experts or num_local_experts
  • gptoss.expert_used_count -- experts per token
  • gptoss.attention.key_length / value_length -- explicit head dimension
  • gptoss.attention.sliding_window -- sliding window size
  • gptoss.rope.freq_base -- RoPE theta
  • gptoss.rope.scaling.factor / original_context_length -- YaRN scaling
  • general.file_type -- set to 4

Special Handling

MXFP4 Expert Weight Handling

Expert weights may arrive as MXFP4 (Microscaling FP4) format with separate .blocks and .scales tensors. The converter pairs these together, performs a byte-level transformation to rearrange nibbles (interleaving high/low 4-bit values), concatenates scales with blocks along dimension 3, and outputs the result with TensorTypeMXFP4.

Gate-Up Expert Splitting

Interleaved gate_up_exps tensors are split into separate gate_exps and up_exps by striding along the expert dimension (even indices for gate, odd for up). This applies to both MXFP4 and regular float tensors, as well as bias tensors.

Custom Token IDs

The tokenizer sets specific token IDs:

  • BOS: 199998 (<|startoftext|>)
  • EOS: 199999 (<|endoftext|>)
  • Additional EOS tokens: 200002 (<|return|>), 200012 (<|call|>)
  • Both add_bos_token and add_eos_token are set to false.

Implementation Notes

The conversion is implemented in convert/convert_gptoss.go via the gptossModel struct. The mxfp4 struct implements io.WriterTo for custom serialization of MXFP4 tensors. The dual-flavor support uses a conditional in Replacements() based on whether HuggingFace-style config keys are present.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment