Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Ggml org Llama cpp CLIP Graph

From Leeroopedia
Knowledge Sources
Domains Multimodal, Vision
Last Updated 2026-02-15 00:00 GMT

Overview

Defines the abstract base class `clip_graph` for building CLIP vision encoder computation graphs.

Description

The `clip_graph` struct holds references to the model, hyperparameters, and a single input image, along with precomputed values such as patch dimensions, embedding size, head count, layer count, and flash attention type. It owns a ggml context and computation graph, and provides a pure virtual `build()` method for subclasses to implement model-specific graph construction. Utility methods include `build_vit()` for standard Vision Transformer graphs, `build_inp()` for input patch construction, `build_norm()` and `build_ffn()` for common layer operations, `build_attn()` for attention, `build_rope_2d()` for 2D rotary position embeddings, `resize_position_embeddings()` for dynamic resolution, and `build_patch_merge_permute()` for pixel shuffle operations.

Usage

Use this header when implementing or extending vision model architectures (SigLIP, Pixtral, Qwen2VL, etc.) within the multimodal pipeline. Subclass `clip_graph` and override the `build()` method for architecture-specific graph construction, or use the provided `build_vit()` utility for standard ViT models.

Code Reference

Source Location

Signature

struct clip_graph {
    const clip_model & model;
    const clip_hparams & hparams;
    projector_type proj_type;
    const clip_image_f32 & img;

    // precomputed values
    const int patch_size, n_patches_x, n_patches_y, n_patches;
    const int n_embd, n_head, d_head, n_layer, n_mmproj_embd;
    const float eps, kq_scale;
    const clip_flash_attn_type flash_attn_type;

    ggml_context_ptr ctx0_ptr;
    ggml_context * ctx0;
    ggml_cgraph * gf;

    clip_graph(clip_ctx * ctx, const clip_image_f32 & img);
    virtual ~clip_graph() = default;
    virtual ggml_cgraph * build() = 0;

    ggml_tensor * build_vit(ggml_tensor * inp, int64_t n_pos, norm_type norm_t, ffn_op_type ffn_t, ggml_tensor * learned_pos_embd, std::function<ggml_tensor *(ggml_tensor *, const clip_layer &)> add_pos);
    ggml_tensor * build_inp();
    ggml_tensor * build_inp_raw(int channels = 3);
    ggml_tensor * build_norm(ggml_tensor * cur, ggml_tensor * mw, ggml_tensor * mb, norm_type type, float norm_eps, int il) const;
    ggml_tensor * build_ffn(ggml_tensor * cur, ggml_tensor * up, ggml_tensor * up_b, ggml_tensor * gate, ggml_tensor * gate_b, ggml_tensor * down, ggml_tensor * down_b, ffn_op_type type_op, int il) const;
    ggml_tensor * build_attn(ggml_tensor * wo, ggml_tensor * wo_b, ggml_tensor * q_cur, ggml_tensor * k_cur, ggml_tensor * v_cur, ggml_tensor * kq_mask, float kq_scale, int il) const;
    ggml_tensor * resize_position_embeddings(uint32_t interpolation_mode = DEFAULT_INTERPOLATION_MODE);
    ggml_tensor * build_rope_2d(ggml_context * ctx0, ggml_tensor * cur, ggml_tensor * pos_a, ggml_tensor * pos_b, float freq_base, bool interleave_freq);
    ggml_tensor * build_patch_merge_permute(ggml_tensor * cur, int scale_factor);
    ggml_tensor * build_stack(ggml_tensor * cur, int32_t stack_factor, int32_t n_embed);
};

Import

#include "clip-graph.h"

I/O Contract

Inputs

Name Type Required Description
ctx clip_ctx * Yes CLIP context containing the model and configuration
img const clip_image_f32 & Yes Single preprocessed input image in float32 format

Outputs

Name Type Description
build() ggml_cgraph * Constructed computation graph for the vision encoder forward pass
build_vit() ggml_tensor * Output tensor from a standard ViT graph
build_inp() ggml_tensor * Tensor of shape [n_embd, n_patches] after conv2d patch extraction

Usage Examples

// Subclass clip_graph for a custom vision model
struct my_clip_graph : clip_graph {
    my_clip_graph(clip_ctx * ctx, const clip_image_f32 & img)
        : clip_graph(ctx, img) {}

    ggml_cgraph * build() override {
        auto * inp = build_inp();
        // Use build_vit for standard ViT architecture
        auto * output = build_vit(inp, n_patches, NORM_TYPE_LAYER, FFN_GELU, model.position_embeddings, nullptr);
        return gf;
    }
};

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment