Implementation:Ollama Ollama Mtmd CogVLM
| Knowledge Sources | |
|---|---|
| Domains | Multimodal, VisionEncoder |
| Last Updated | 2025-02-15 00:00 GMT |
Overview
Multimodal graph builder for the CogVLM vision model, constructing the ViT encoder and SwiGLU-based projector computation graph.
Description
Implements clip_graph_cogvlm::build() which constructs a ggml computation graph for the CogVLM vision encoder. The architecture uses class embedding concatenation, learned position embeddings, fused QKV attention with post-attention layer normalization, and SiLU-gated feed-forward layers with post-FFN normalization. The projector stage removes the CLS token, applies a linear projection, post-FC normalization with GELU, then a SwiGLU gate (h_to_4h and gate branches merged via ggml_swiglu_split), followed by a down-projection. Beginning-of-image (BOI) and end-of-image (EOI) tokens are concatenated to the output.
Usage
Automatically selected when the loaded CLIP model uses the PROJECTOR_TYPE_COGVLM projector type.
Code Reference
Source Location
- Repository: Ollama
- File: llama/llama.cpp/tools/mtmd/models/cogvlm.cpp
- Lines: 1-98
Signature
struct clip_graph_cogvlm : public clip_graph {
ggml_cgraph * build() override;
};
Import
#include "models.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model | clip_model & | Yes | Loaded CogVLM CLIP model with weights |
| img | clip_image_f32 & | Yes | Preprocessed float image tensor |
Outputs
| Name | Type | Description |
|---|---|---|
| ggml_cgraph * | pointer | Computation graph producing LLM-compatible embeddings with BOI/EOI tokens |
Usage Examples
// Instantiated internally by clip.cpp for CogVLM models
clip_graph_cogvlm graph(ctx, img);
ggml_cgraph * gf = graph.build();
// Produces embeddings: [BOI, patch_1, ..., patch_N, EOI]