Principle:Ollama Ollama GGUF Model Conversion DeepSeek2

Knowledge Sources	Ollama
Domains	Model Conversion, DeepSeek
Last Updated	2025-02-15 00:00 GMT

Overview

DeepSeek2 model conversion handles the transformation of DeepSeek-V2/V3 architecture models from HuggingFace SafeTensors to GGUF format, with particular attention to Multi-head Latent Attention (MLA) projections, Mixture-of-Experts (MoE) routing with expert tensor merging, and YaRN RoPE scaling parameters.

Core Concepts

Tensor Name Mapping

The converter applies the following HuggingFace-to-GGUF tensor name replacements:

lm_head -> output
model.embed_tokens -> token_embd
model.norm -> output_norm
language_model. -> (stripped)
model.layers -> blk
input_layernorm -> attn_norm
self_attn.kv_a_proj_with_mqa -> attn_kv_a_mqa
self_attn.kv_a_layernorm -> attn_kv_a_norm
self_attn.kv_b_proj -> attn_kv_b
self_attn.q_a_proj -> attn_q_a
self_attn.q_a_layernorm -> attn_q_a_norm
self_attn.q_b_proj -> attn_q_b
self_attn.o_proj -> attn_output
post_attention_layernorm -> ffn_norm
mlp.shared_experts.down_proj -> ffn_down_shexp
mlp.shared_experts.gate_proj -> ffn_gate_shexp
mlp.shared_experts.up_proj -> ffn_up_shexp
mlp.gate.e_score_correction_bias -> exp_probs_b.bias
mlp.gate -> ffn_gate_inp

Architecture-Specific Hyperparameters

The GGUF metadata is written under the deepseek2.* namespace:

deepseek2.block_count -- number of hidden layers
deepseek2.attention.head_count / head_count_kv -- Q and KV head counts
deepseek2.attention.key_length -- qk_nope_head_dim + qk_rope_head_dim
deepseek2.attention.kv_lora_rank -- KV LoRA rank for MLA compression
deepseek2.attention.q_lora_rank -- Q LoRA rank
deepseek2.attention.value_length -- V head dimension
deepseek2.expert_count / expert_used_count / expert_shared_count
deepseek2.expert_gating_func -- 1 for softmax, 2 for sigmoid
deepseek2.expert_weights_norm / expert_weights_scale
deepseek2.leading_dense_block_count -- number of initial dense (non-MoE) layers
deepseek2.rope.dimension_count -- equals qk_rope_head_dim
deepseek2.rope.freq_base -- defaults to 10000.0
deepseek2.rope.scaling.* -- YaRN scaling parameters including yarn_log_multiplier

Special Handling

Expert Tensor Merging

Individual expert weight tensors (pattern: blk.N.mlp.experts.*.{gate,up,down}_proj.weight) are merged into stacked tensors (blk.N.ffn_{gate,up,down}_exps.weight). This creates three merged tensors per layer: gate, up, and down projections.

Multi-Token Prediction Layer Skipping

Layers with block indices >= num_hidden_layers are skipped during conversion, as they represent Multi-Token Prediction heads not needed for standard inference.

Tokenizer

The tokenizer pre-processor is set to deepseek-v3.

Implementation Notes

The conversion is implemented in convert/convert_deepseek2.go via the deepseek2Model struct. The expert merging logic uses the mergeTensors utility with glob-style patterns to match and stack individual expert tensors. A regex-based layer skipping function filters out extraneous prediction heads beyond the declared layer count.

Related Pages

Implementation:Ollama_Ollama_Convert_DeepSeek2

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment