Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Ollama Ollama GGUF Model Conversion Phi3

From Leeroopedia
Knowledge Sources
Domains Model Conversion, Phi
Last Updated 2025-02-15 00:00 GMT

Overview

Phi-3 conversion handles Microsoft's Phi-3 architecture with long context support via SuRoPE (Scaled Unified Rotary Position Embedding) using separate long and short rope scaling factors, fused QKV and gate-up projections, and sliding window attention, transforming the model from HuggingFace SafeTensors to GGUF format.

Core Concepts

Tensor Name Mapping

The converter applies the following HuggingFace-to-GGUF tensor name replacements:

  • lm_head -> output
  • model.embed_tokens -> token_embd
  • model.norm -> output_norm
  • model.layers -> blk
  • input_layernorm -> attn_norm
  • self_attn.qkv_proj -> attn_qkv (fused QKV)
  • self_attn.o_proj -> attn_output
  • mlp.down_proj -> ffn_down
  • mlp.gate_up_proj -> ffn_up (fused gate+up)
  • post_attention_layernorm -> ffn_norm

Architecture-Specific Hyperparameters

The GGUF metadata is written under the phi3.* namespace:

  • phi3.context_length -- max position embeddings (extended, e.g. 128K)
  • phi3.embedding_length -- hidden size
  • phi3.feed_forward_length -- intermediate size
  • phi3.block_count -- number of hidden layers
  • phi3.attention.head_count, head_count_kv
  • phi3.attention.layer_norm_rms_epsilon -- RMSNorm epsilon
  • phi3.rope.dimension_count -- derived from hidden_size / num_heads
  • phi3.rope.freq_base -- RoPE theta
  • phi3.rope.scaling.original_context_length -- original context length before scaling
  • phi3.rope.scaling.attn_factor -- computed attention scaling factor
  • phi3.attention.sliding_window -- sliding window size

Special Handling

SuRoPE Long/Short Factors

Phi-3 uses a dual-factor RoPE scaling scheme with separate long_factor and short_factor arrays. These are stored as explicit GGUF tensors:

  • rope_factors_long.weight -- per-dimension long-context scaling factors
  • rope_factors_short.weight -- per-dimension short-context scaling factors

These tensors are injected before the first block's tensors using a sync.Once guard.

Attention Factor Computation

The attn_factor is computed at conversion time based on the scaling type:

  • su/longrope: max(sqrt(1 + ln(scale) / ln(original_context)), 1.0)
  • yarn: max(0.1 * ln(scale) + 1.0, 1.0)

where scale = max_position_embeddings / original_max_position_embeddings.

Fused Projections

Phi-3 uses fused QKV projections (attn_qkv instead of separate Q/K/V) and fused gate-up projections (ffn_up containing both gate and up). These are stored as single tensors in GGUF without splitting.

Rope Factor as Custom Type

The ropeFactor type implements io.WriterTo for direct binary serialization of float32 arrays, used by both Phi-3 and other models that reference this type.

Implementation Notes

The conversion is implemented in convert/convert_phi3.go via the phi3Model struct. The ropeFactor type defined here is reused by other converters (such as Qwen3) that need similar rope factor tensor support. The converter supports multiple config key aliases (e.g., n_layers vs num_hidden_layers) for compatibility with different Phi-3 checkpoint formats.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment