Principle:Ollama Ollama GGUF Model Conversion DeepSeekOcr

Knowledge Sources	Ollama
Domains	Model Conversion, OCR
Last Updated	2025-02-15 00:00 GMT

Overview

DeepSeek OCR variant conversion handles a multimodal architecture combining a DeepSeek-style MoE language model with dual vision encoders (CLIP and SAM) for optical character recognition tasks, transforming the complete pipeline from HuggingFace SafeTensors to GGUF format.

Core Concepts

Tensor Name Mapping

The converter applies the following HuggingFace-to-GGUF tensor name replacements:

Language model:

model.embed_tokens -> token_embd
model.layers -> blk
model.norm -> output_norm
lm_head -> output
self_attn.q_proj -> attn_q
self_attn.k_proj -> attn_k
self_attn.v_proj -> attn_v
self_attn.o_proj -> attn_output
mlp.shared_experts.{gate,up,down}_proj -> ffn_{gate,up,down}_shexp
mlp.gate -> ffn_gate_inp

Vision encoder:

model.vision_model -> v
embeddings.patch_embedding -> patch_embd
embeddings.class_embedding -> class_embd
embeddings.position_embedding -> position_embd
transformer.layers -> blk

SAM encoder:

model.sam_model.patch_embed.proj -> s.patch_embd
model.sam_model.pos_embed -> s.position_embd
model.sam_model.blocks -> s.blk
model.sam_model.neck -> s.neck

Projector:

model.projector -> mm
model.image_newline -> mm.image_newline
model.view_seperator -> mm.view_seperator (upstream misspelling preserved)

Architecture-Specific Hyperparameters

The GGUF metadata is written under the deepseekocr architecture with three sub-namespaces:

Language:

block_count, context_length, embedding_length, feed_forward_length
attention.head_count, attention.head_count_kv
expert_count, expert_used_count, leading_dense_block_count

CLIP vision:

vision.block_count, vision.embedding_length, vision.head_count
vision.image_size, vision.patch_size

SAM:

sam.block_count, sam.embedding_length, sam.head_count
sam.global_attention_indexes -- indices of layers using global (non-windowed) attention

Special Handling

Dual Vision Encoder

The model has two separate vision encoders: a CLIP-L-14-224 encoder for semantic features and a SAM ViT-B encoder for spatial/structural features. Both are mapped to distinct GGUF namespace prefixes (v. and s.).

Expert Tensor Merging

As with DeepSeek2, individual expert tensors are merged into stacked tensors per layer for gate, up, and down projections.

Nested Configuration

The HuggingFace config uses nested language_config and vision_config structures, with the vision config further nesting per-encoder parameters under named keys.

Implementation Notes

The conversion is implemented in convert/convert_deepseekocr.go via the deepseekocr struct. The struct uses nested Go structs matching the HuggingFace config hierarchy. The upstream misspelling of "view_seperator" is intentionally preserved to maintain model compatibility.

Related Pages

Implementation:Ollama_Ollama_Convert_DeepSeekOcr

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment