Implementation:Ggml org Llama cpp CLIP Graph
| Knowledge Sources | |
|---|---|
| Domains | Multimodal, Vision |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Defines the abstract base class `clip_graph` for building CLIP vision encoder computation graphs.
Description
The `clip_graph` struct holds references to the model, hyperparameters, and a single input image, along with precomputed values such as patch dimensions, embedding size, head count, layer count, and flash attention type. It owns a ggml context and computation graph, and provides a pure virtual `build()` method for subclasses to implement model-specific graph construction. Utility methods include `build_vit()` for standard Vision Transformer graphs, `build_inp()` for input patch construction, `build_norm()` and `build_ffn()` for common layer operations, `build_attn()` for attention, `build_rope_2d()` for 2D rotary position embeddings, `resize_position_embeddings()` for dynamic resolution, and `build_patch_merge_permute()` for pixel shuffle operations.
Usage
Use this header when implementing or extending vision model architectures (SigLIP, Pixtral, Qwen2VL, etc.) within the multimodal pipeline. Subclass `clip_graph` and override the `build()` method for architecture-specific graph construction, or use the provided `build_vit()` utility for standard ViT models.
Code Reference
Source Location
- Repository: Ggml_org_Llama_cpp
- File: tools/mtmd/clip-graph.h
- Lines: 1-117
Signature
struct clip_graph {
const clip_model & model;
const clip_hparams & hparams;
projector_type proj_type;
const clip_image_f32 & img;
// precomputed values
const int patch_size, n_patches_x, n_patches_y, n_patches;
const int n_embd, n_head, d_head, n_layer, n_mmproj_embd;
const float eps, kq_scale;
const clip_flash_attn_type flash_attn_type;
ggml_context_ptr ctx0_ptr;
ggml_context * ctx0;
ggml_cgraph * gf;
clip_graph(clip_ctx * ctx, const clip_image_f32 & img);
virtual ~clip_graph() = default;
virtual ggml_cgraph * build() = 0;
ggml_tensor * build_vit(ggml_tensor * inp, int64_t n_pos, norm_type norm_t, ffn_op_type ffn_t, ggml_tensor * learned_pos_embd, std::function<ggml_tensor *(ggml_tensor *, const clip_layer &)> add_pos);
ggml_tensor * build_inp();
ggml_tensor * build_inp_raw(int channels = 3);
ggml_tensor * build_norm(ggml_tensor * cur, ggml_tensor * mw, ggml_tensor * mb, norm_type type, float norm_eps, int il) const;
ggml_tensor * build_ffn(ggml_tensor * cur, ggml_tensor * up, ggml_tensor * up_b, ggml_tensor * gate, ggml_tensor * gate_b, ggml_tensor * down, ggml_tensor * down_b, ffn_op_type type_op, int il) const;
ggml_tensor * build_attn(ggml_tensor * wo, ggml_tensor * wo_b, ggml_tensor * q_cur, ggml_tensor * k_cur, ggml_tensor * v_cur, ggml_tensor * kq_mask, float kq_scale, int il) const;
ggml_tensor * resize_position_embeddings(uint32_t interpolation_mode = DEFAULT_INTERPOLATION_MODE);
ggml_tensor * build_rope_2d(ggml_context * ctx0, ggml_tensor * cur, ggml_tensor * pos_a, ggml_tensor * pos_b, float freq_base, bool interleave_freq);
ggml_tensor * build_patch_merge_permute(ggml_tensor * cur, int scale_factor);
ggml_tensor * build_stack(ggml_tensor * cur, int32_t stack_factor, int32_t n_embed);
};
Import
#include "clip-graph.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| ctx | clip_ctx * | Yes | CLIP context containing the model and configuration |
| img | const clip_image_f32 & | Yes | Single preprocessed input image in float32 format |
Outputs
| Name | Type | Description |
|---|---|---|
| build() | ggml_cgraph * | Constructed computation graph for the vision encoder forward pass |
| build_vit() | ggml_tensor * | Output tensor from a standard ViT graph |
| build_inp() | ggml_tensor * | Tensor of shape [n_embd, n_patches] after conv2d patch extraction |
Usage Examples
// Subclass clip_graph for a custom vision model
struct my_clip_graph : clip_graph {
my_clip_graph(clip_ctx * ctx, const clip_image_f32 & img)
: clip_graph(ctx, img) {}
ggml_cgraph * build() override {
auto * inp = build_inp();
// Use build_vit for standard ViT architecture
auto * output = build_vit(inp, n_patches, NORM_TYPE_LAYER, FFN_GELU, model.position_embeddings, nullptr);
return gf;
}
};