Principle:Ollama Ollama GGUF Model Conversion GlmOcr

Knowledge Sources	Ollama
Domains	Model Conversion, OCR
Last Updated	2025-02-15 00:00 GMT

Overview

GLM OCR conversion handles the ChatGLM-based multimodal OCR model, transforming a vision-language architecture with M-RoPE (Multi-dimensional Rotary Position Embedding), fused gate-up projections, temporal patch embedding splitting, and Q/K weight permutation for NeoX-style rotation compatibility with the GGML runtime.

Core Concepts

Tensor Name Mapping

The converter applies the following HuggingFace-to-GGUF tensor name replacements:

Vision encoder:

model.visual.patch_embed.proj -> v.patch_embd
model.visual.patch_embed.proj_1 -> v.patch_embd_1
model.visual.blocks -> v.blk
model.visual.post_layernorm -> v.post_ln
model.visual.downsample -> mm.patch_merger
attn.qkv -> attn_qkv
attn.proj -> attn_out

Merger (multimodal projector):

model.visual.merger.proj -> mm.model.fc
model.visual.merger.post_projection_norm -> mm.post_norm
model.visual.merger.gate_proj -> mm.gate

Language model:

model.language_model.embed_tokens -> token_embd
model.language_model.layers -> blk
self_attn.o_proj -> attn_out
mlp.gate_up_proj -> ffn_gate_up (then split)

Architecture-Specific Hyperparameters

The GGUF metadata is written under the glmocr.* namespace:

Text:

glmocr.block_count, embedding_length, feed_forward_length, context_length
glmocr.attention.head_count, head_count_kv, key_length, value_length
glmocr.rope.freq_base, partial_rotary_factor, mrope_section

Vision:

glmocr.vision.block_count, embedding_length, out_hidden_size, intermediate_size
glmocr.vision.image_size, patch_size, spatial_merge_size, temporal_patch_size
glmocr.vision.min_pixels, max_pixels, image_mean, image_std

Special tokens:

glmocr.image_token_id, image_start_token_id, image_end_token_id
glmocr.video_token_id, video_start_token_id, video_end_token_id

Special Handling

Fused Gate-Up Splitting

The ffn_gate_up tensor is split along dimension 0 into separate ffn_gate and ffn_up tensors using the splitDim utility.

Temporal Patch Embedding Splitting

5D patch embedding weights with shape [out, in, 2, H, W] are split along the temporal dimension into two separate 4D tensors (patch_embd_0 and patch_embd_1). Pre-split variants with .0. and .1. suffixes are also handled.

Q/K Weight Permutation for M-RoPE

When M-RoPE sections are present, Q and K weight tensors are permuted from interleaved (LLaMA-style) to NeoX ordering using the normalToNeoXRepacker. This reorders rotary dimensions from [0,1,2,3,4,5...] to [0,2,4...,1,3,5...] so that GGML's NeoX-style M-RoPE kernel operates correctly.

Preprocessor Config

The converter reads preprocessor_config.json to extract image normalization parameters (mean, std) and size constraints (shortest_edge, longest_edge).

Multi-Token Prediction Layer Skipping

Layers beyond num_hidden_layers are skipped during conversion.

Implementation Notes

The conversion is implemented in convert/convert_glmocr.go via the glmOcrModel struct which satisfies both ModelConverter and moreParser interfaces. The normalToNeoXRepacker is a standalone function that performs per-head rotary dimension reordering, handling both weight (2D) and bias (1D) tensors with support for partial rotary factors.

Related Pages

Implementation:Ollama_Ollama_Convert_GlmOcr

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment