Principle:Ollama Ollama GGUF Model Conversion Phi3
| Knowledge Sources | |
|---|---|
| Domains | Model Conversion, Phi |
| Last Updated | 2025-02-15 00:00 GMT |
Overview
Phi-3 conversion handles Microsoft's Phi-3 architecture with long context support via SuRoPE (Scaled Unified Rotary Position Embedding) using separate long and short rope scaling factors, fused QKV and gate-up projections, and sliding window attention, transforming the model from HuggingFace SafeTensors to GGUF format.
Core Concepts
Tensor Name Mapping
The converter applies the following HuggingFace-to-GGUF tensor name replacements:
lm_head->outputmodel.embed_tokens->token_embdmodel.norm->output_normmodel.layers->blkinput_layernorm->attn_normself_attn.qkv_proj->attn_qkv(fused QKV)self_attn.o_proj->attn_outputmlp.down_proj->ffn_downmlp.gate_up_proj->ffn_up(fused gate+up)post_attention_layernorm->ffn_norm
Architecture-Specific Hyperparameters
The GGUF metadata is written under the phi3.* namespace:
phi3.context_length-- max position embeddings (extended, e.g. 128K)phi3.embedding_length-- hidden sizephi3.feed_forward_length-- intermediate sizephi3.block_count-- number of hidden layersphi3.attention.head_count,head_count_kvphi3.attention.layer_norm_rms_epsilon-- RMSNorm epsilonphi3.rope.dimension_count-- derived fromhidden_size / num_headsphi3.rope.freq_base-- RoPE thetaphi3.rope.scaling.original_context_length-- original context length before scalingphi3.rope.scaling.attn_factor-- computed attention scaling factorphi3.attention.sliding_window-- sliding window size
Special Handling
SuRoPE Long/Short Factors
Phi-3 uses a dual-factor RoPE scaling scheme with separate long_factor and short_factor arrays. These are stored as explicit GGUF tensors:
rope_factors_long.weight-- per-dimension long-context scaling factorsrope_factors_short.weight-- per-dimension short-context scaling factors
These tensors are injected before the first block's tensors using a sync.Once guard.
Attention Factor Computation
The attn_factor is computed at conversion time based on the scaling type:
- su/longrope:
max(sqrt(1 + ln(scale) / ln(original_context)), 1.0) - yarn:
max(0.1 * ln(scale) + 1.0, 1.0)
where scale = max_position_embeddings / original_max_position_embeddings.
Fused Projections
Phi-3 uses fused QKV projections (attn_qkv instead of separate Q/K/V) and fused gate-up projections (ffn_up containing both gate and up). These are stored as single tensors in GGUF without splitting.
Rope Factor as Custom Type
The ropeFactor type implements io.WriterTo for direct binary serialization of float32 arrays, used by both Phi-3 and other models that reference this type.
Implementation Notes
The conversion is implemented in convert/convert_phi3.go via the phi3Model struct. The ropeFactor type defined here is reused by other converters (such as Qwen3) that need similar rope factor tensor support. The converter supports multiple config key aliases (e.g., n_layers vs num_hidden_layers) for compatibility with different Phi-3 checkpoint formats.