Implementation:Ollama Ollama Mtmd GLM4V
| Knowledge Sources | |
|---|---|
| Domains | Multimodal, VisionEncoder |
| Last Updated | 2025-02-15 00:00 GMT |
Overview
Multimodal graph builder for the GLM-4V vision model, implementing dual convolution patch embedding, M-RoPE, and a patch merger projector.
Description
Implements clip_graph_glm4v::build() which constructs a ggml computation graph for the GLM-4V vision encoder. Uses dual conv2d patch embedding (two convolution layers summed), pixel unshuffling to merge spatial dimensions into the embedding dimension, patch bias addition, RMS normalization, bicubic-interpolated position embeddings (via resize_position_embeddings), and M-RoPE with 4-section rotary position encoding. The projector applies a conv2d-based patch merger for spatial downsampling, a fully-connected projection layer with LayerNorm and GELU-ERF activation, followed by an FFN (up/gate/down) block to produce language-model-compatible embeddings.
Usage
Automatically selected when the loaded CLIP model uses the PROJECTOR_TYPE_GLM4V projector type.
Code Reference
Source Location
- Repository: Ollama
- File: llama/llama.cpp/tools/mtmd/models/glm4v.cpp
- Lines: 1-120
Signature
struct clip_graph_glm4v : public clip_graph {
ggml_cgraph * build() override;
};
Import
#include "models.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model | clip_model & | Yes | Loaded GLM-4V CLIP model with weights |
| img | clip_image_f32 & | Yes | Preprocessed float image (must have dimensions divisible by patch_size * 2) |
Outputs
| Name | Type | Description |
|---|---|---|
| ggml_cgraph * | pointer | Computation graph producing spatially-downsampled LLM embeddings |
Usage Examples
// Instantiated internally by clip.cpp for GLM-4V models
clip_graph_glm4v graph(ctx, img);
ggml_cgraph * gf = graph.build();
// Output: [n_mmproj_embd, n_patches / merge^2] embeddings