Implementation:Ollama Ollama Clip Graph

Knowledge Sources	Ollama
Domains	Multimodal, CLIP
Last Updated	2025-02-15 00:00 GMT

Overview

Abstract base class header for CLIP vision/audio encoder computation graphs, providing shared building blocks for all multimodal model graph builders.

Description

Defines the clip_graph struct that serves as the base for all vision and audio encoder graph builders. It holds references to the model, hyperparameters, image data, patch dimensions, and ggml context. Provides a pure virtual build() method that subclasses override to construct model-specific computation graphs. Also provides shared utility methods: build_vit (generic Vision Transformer graph), build_inp / build_inp_raw (input tensor construction after conv2d), build_attn (multi-head attention sub-graph), build_ffn (feed-forward sub-graph), build_norm (layer/RMS normalization), build_rope_2d (2D rotary position embedding), build_patch_merge_permute (pixel shuffle/unshuffle), and build_stack (audio frame stacking).

Usage

Subclassed by all model-specific graph builders (llava, qwen2vl, qwen3vl, siglip, whisper-enc, cogvlm, glm4v, llama4, minicpmv).

Code Reference

Source Location

Repository: Ollama
File: llama/llama.cpp/tools/mtmd/clip-graph.h
Lines: 1-121

Signature

struct clip_graph {
    const clip_model & model;
    const clip_hparams & hparams;
    projector_type proj_type;
    const clip_image_f32 & img;
    const int patch_size, n_patches_x, n_patches_y, n_patches;
    const int n_embd, n_head, d_head, n_layer, n_mmproj_embd;
    const float eps, kq_scale;

    ggml_context * ctx0;
    ggml_cgraph * gf;

    clip_graph(clip_ctx * ctx, const clip_image_f32 & img);
    virtual ~clip_graph() = default;
    virtual ggml_cgraph * build() = 0;

    ggml_tensor * build_vit(ggml_tensor * inp, int64_t n_pos,
        norm_type norm_t, ffn_op_type ffn_t,
        ggml_tensor * learned_pos_embd,
        std::function<ggml_tensor *(ggml_tensor *, const clip_layer &)> add_pos);
    ggml_tensor * build_inp();
    ggml_tensor * build_inp_raw(int channels = 3);
    ggml_tensor * build_norm(ggml_tensor * cur, ggml_tensor * mw, ggml_tensor * mb,
        norm_type type, float norm_eps, int il) const;
    ggml_tensor * build_ffn(ggml_tensor * cur, ...);
    ggml_tensor * build_attn(ggml_tensor * wo, ggml_tensor * wo_b,
        ggml_tensor * q_cur, ggml_tensor * k_cur, ggml_tensor * v_cur,
        ggml_tensor * kq_mask, float kq_scale, int il) const;
    ggml_tensor * build_rope_2d(ggml_context * ctx0, ggml_tensor * cur,
        ggml_tensor * pos_a, ggml_tensor * pos_b,
        const float freq_base, const bool interleave_freq);
    ggml_tensor * build_patch_merge_permute(ggml_tensor * cur, int scale_factor);
    ggml_tensor * build_stack(ggml_tensor * cur, int32_t stack_factor, int32_t n_embed);
};

Import

#include "clip-graph.h"

I/O Contract

Inputs

Name	Type	Required	Description
ctx	clip_ctx *	Yes	Initialized CLIP context with model and backends
img	clip_image_f32 &	Yes	Preprocessed float image to encode

Outputs

Name	Type	Description
ggml_cgraph *	pointer	Built computation graph ready for backend evaluation

Usage Examples

// Subclass pattern for a new model
struct clip_graph_my_model : public clip_graph {
    using clip_graph::clip_graph;
    ggml_cgraph * build() override {
        ggml_tensor * inp = build_inp();
        ggml_tensor * cur = build_vit(inp, n_patches,
            NORM_TYPE_NORMAL, FFN_GELU,
            model.position_embeddings, nullptr);
        cur = ggml_mul_mat(ctx0, model.projection, cur);
        ggml_build_forward_expand(gf, cur);
        return gf;
    }
};

Related Pages

Principle:Ollama_Ollama_MultimodalPipeline

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment