Implementation:Ollama Ollama Clip Graph
| Knowledge Sources | |
|---|---|
| Domains | Multimodal, CLIP |
| Last Updated | 2025-02-15 00:00 GMT |
Overview
Abstract base class header for CLIP vision/audio encoder computation graphs, providing shared building blocks for all multimodal model graph builders.
Description
Defines the clip_graph struct that serves as the base for all vision and audio encoder graph builders. It holds references to the model, hyperparameters, image data, patch dimensions, and ggml context. Provides a pure virtual build() method that subclasses override to construct model-specific computation graphs. Also provides shared utility methods: build_vit (generic Vision Transformer graph), build_inp / build_inp_raw (input tensor construction after conv2d), build_attn (multi-head attention sub-graph), build_ffn (feed-forward sub-graph), build_norm (layer/RMS normalization), build_rope_2d (2D rotary position embedding), build_patch_merge_permute (pixel shuffle/unshuffle), and build_stack (audio frame stacking).
Usage
Subclassed by all model-specific graph builders (llava, qwen2vl, qwen3vl, siglip, whisper-enc, cogvlm, glm4v, llama4, minicpmv).
Code Reference
Source Location
- Repository: Ollama
- File: llama/llama.cpp/tools/mtmd/clip-graph.h
- Lines: 1-121
Signature
struct clip_graph {
const clip_model & model;
const clip_hparams & hparams;
projector_type proj_type;
const clip_image_f32 & img;
const int patch_size, n_patches_x, n_patches_y, n_patches;
const int n_embd, n_head, d_head, n_layer, n_mmproj_embd;
const float eps, kq_scale;
ggml_context * ctx0;
ggml_cgraph * gf;
clip_graph(clip_ctx * ctx, const clip_image_f32 & img);
virtual ~clip_graph() = default;
virtual ggml_cgraph * build() = 0;
ggml_tensor * build_vit(ggml_tensor * inp, int64_t n_pos,
norm_type norm_t, ffn_op_type ffn_t,
ggml_tensor * learned_pos_embd,
std::function<ggml_tensor *(ggml_tensor *, const clip_layer &)> add_pos);
ggml_tensor * build_inp();
ggml_tensor * build_inp_raw(int channels = 3);
ggml_tensor * build_norm(ggml_tensor * cur, ggml_tensor * mw, ggml_tensor * mb,
norm_type type, float norm_eps, int il) const;
ggml_tensor * build_ffn(ggml_tensor * cur, ...);
ggml_tensor * build_attn(ggml_tensor * wo, ggml_tensor * wo_b,
ggml_tensor * q_cur, ggml_tensor * k_cur, ggml_tensor * v_cur,
ggml_tensor * kq_mask, float kq_scale, int il) const;
ggml_tensor * build_rope_2d(ggml_context * ctx0, ggml_tensor * cur,
ggml_tensor * pos_a, ggml_tensor * pos_b,
const float freq_base, const bool interleave_freq);
ggml_tensor * build_patch_merge_permute(ggml_tensor * cur, int scale_factor);
ggml_tensor * build_stack(ggml_tensor * cur, int32_t stack_factor, int32_t n_embed);
};
Import
#include "clip-graph.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| ctx | clip_ctx * | Yes | Initialized CLIP context with model and backends |
| img | clip_image_f32 & | Yes | Preprocessed float image to encode |
Outputs
| Name | Type | Description |
|---|---|---|
| ggml_cgraph * | pointer | Built computation graph ready for backend evaluation |
Usage Examples
// Subclass pattern for a new model
struct clip_graph_my_model : public clip_graph {
using clip_graph::clip_graph;
ggml_cgraph * build() override {
ggml_tensor * inp = build_inp();
ggml_tensor * cur = build_vit(inp, n_patches,
NORM_TYPE_NORMAL, FFN_GELU,
model.position_embeddings, nullptr);
cur = ggml_mul_mat(ctx0, model.projection, cur);
ggml_build_forward_expand(gf, cur);
return gf;
}
};