Principle:Ollama Ollama GGUF Model Conversion GlmOcr
| Knowledge Sources | |
|---|---|
| Domains | Model Conversion, OCR |
| Last Updated | 2025-02-15 00:00 GMT |
Overview
GLM OCR conversion handles the ChatGLM-based multimodal OCR model, transforming a vision-language architecture with M-RoPE (Multi-dimensional Rotary Position Embedding), fused gate-up projections, temporal patch embedding splitting, and Q/K weight permutation for NeoX-style rotation compatibility with the GGML runtime.
Core Concepts
Tensor Name Mapping
The converter applies the following HuggingFace-to-GGUF tensor name replacements:
Vision encoder:
model.visual.patch_embed.proj->v.patch_embdmodel.visual.patch_embed.proj_1->v.patch_embd_1model.visual.blocks->v.blkmodel.visual.post_layernorm->v.post_lnmodel.visual.downsample->mm.patch_mergerattn.qkv->attn_qkvattn.proj->attn_out
Merger (multimodal projector):
model.visual.merger.proj->mm.model.fcmodel.visual.merger.post_projection_norm->mm.post_normmodel.visual.merger.gate_proj->mm.gate
Language model:
model.language_model.embed_tokens->token_embdmodel.language_model.layers->blkself_attn.o_proj->attn_outmlp.gate_up_proj->ffn_gate_up(then split)
Architecture-Specific Hyperparameters
The GGUF metadata is written under the glmocr.* namespace:
Text:
glmocr.block_count,embedding_length,feed_forward_length,context_lengthglmocr.attention.head_count,head_count_kv,key_length,value_lengthglmocr.rope.freq_base,partial_rotary_factor,mrope_section
Vision:
glmocr.vision.block_count,embedding_length,out_hidden_size,intermediate_sizeglmocr.vision.image_size,patch_size,spatial_merge_size,temporal_patch_sizeglmocr.vision.min_pixels,max_pixels,image_mean,image_std
Special tokens:
glmocr.image_token_id,image_start_token_id,image_end_token_idglmocr.video_token_id,video_start_token_id,video_end_token_id
Special Handling
Fused Gate-Up Splitting
The ffn_gate_up tensor is split along dimension 0 into separate ffn_gate and ffn_up tensors using the splitDim utility.
Temporal Patch Embedding Splitting
5D patch embedding weights with shape [out, in, 2, H, W] are split along the temporal dimension into two separate 4D tensors (patch_embd_0 and patch_embd_1). Pre-split variants with .0. and .1. suffixes are also handled.
Q/K Weight Permutation for M-RoPE
When M-RoPE sections are present, Q and K weight tensors are permuted from interleaved (LLaMA-style) to NeoX ordering using the normalToNeoXRepacker. This reorders rotary dimensions from [0,1,2,3,4,5...] to [0,2,4...,1,3,5...] so that GGML's NeoX-style M-RoPE kernel operates correctly.
Preprocessor Config
The converter reads preprocessor_config.json to extract image normalization parameters (mean, std) and size constraints (shortest_edge, longest_edge).
Multi-Token Prediction Layer Skipping
Layers beyond num_hidden_layers are skipped during conversion.
Implementation Notes
The conversion is implemented in convert/convert_glmocr.go via the glmOcrModel struct which satisfies both ModelConverter and moreParser interfaces. The normalToNeoXRepacker is a standalone function that performs per-head rotary dimension reordering, handling both weight (2D) and bias (1D) tensors with support for partial rotary factors.