Principle:Ollama Ollama GGUF Model Conversion Phi3

Knowledge Sources	Ollama
Domains	Model Conversion, Phi
Last Updated	2025-02-15 00:00 GMT

Overview

Phi-3 conversion handles Microsoft's Phi-3 architecture with long context support via SuRoPE (Scaled Unified Rotary Position Embedding) using separate long and short rope scaling factors, fused QKV and gate-up projections, and sliding window attention, transforming the model from HuggingFace SafeTensors to GGUF format.

Core Concepts

Tensor Name Mapping

The converter applies the following HuggingFace-to-GGUF tensor name replacements:

lm_head -> output
model.embed_tokens -> token_embd
model.norm -> output_norm
model.layers -> blk
input_layernorm -> attn_norm
self_attn.qkv_proj -> attn_qkv (fused QKV)
self_attn.o_proj -> attn_output
mlp.down_proj -> ffn_down
mlp.gate_up_proj -> ffn_up (fused gate+up)
post_attention_layernorm -> ffn_norm

Architecture-Specific Hyperparameters

The GGUF metadata is written under the phi3.* namespace:

phi3.context_length -- max position embeddings (extended, e.g. 128K)
phi3.embedding_length -- hidden size
phi3.feed_forward_length -- intermediate size
phi3.block_count -- number of hidden layers
phi3.attention.head_count, head_count_kv
phi3.attention.layer_norm_rms_epsilon -- RMSNorm epsilon
phi3.rope.dimension_count -- derived from hidden_size / num_heads
phi3.rope.freq_base -- RoPE theta
phi3.rope.scaling.original_context_length -- original context length before scaling
phi3.rope.scaling.attn_factor -- computed attention scaling factor
phi3.attention.sliding_window -- sliding window size

Special Handling

SuRoPE Long/Short Factors

Phi-3 uses a dual-factor RoPE scaling scheme with separate long_factor and short_factor arrays. These are stored as explicit GGUF tensors:

rope_factors_long.weight -- per-dimension long-context scaling factors
rope_factors_short.weight -- per-dimension short-context scaling factors

These tensors are injected before the first block's tensors using a sync.Once guard.

Attention Factor Computation

The attn_factor is computed at conversion time based on the scaling type:

su/longrope: max(sqrt(1 + ln(scale) / ln(original_context)), 1.0)
yarn: max(0.1 * ln(scale) + 1.0, 1.0)

where scale = max_position_embeddings / original_max_position_embeddings.

Fused Projections

Phi-3 uses fused QKV projections (attn_qkv instead of separate Q/K/V) and fused gate-up projections (ffn_up containing both gate and up). These are stored as single tensors in GGUF without splitting.

Rope Factor as Custom Type

The ropeFactor type implements io.WriterTo for direct binary serialization of float32 arrays, used by both Phi-3 and other models that reference this type.

Implementation Notes

The conversion is implemented in convert/convert_phi3.go via the phi3Model struct. The ropeFactor type defined here is reused by other converters (such as Qwen3) that need similar rope factor tensor support. The converter supports multiple config key aliases (e.g., n_layers vs num_hidden_layers) for compatibility with different Phi-3 checkpoint formats.

Related Pages

Implementation:Ollama_Ollama_Convert_Phi3

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment