Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Ollama Ollama Clip Model

From Leeroopedia
Knowledge Sources
Domains Multimodal, CLIP
Last Updated 2025-02-15 00:00 GMT

Overview

Defines the CLIP model data structures including hyperparameters, transformer layers, and the overall model for vision and audio encoders.

Description

This header defines clip_hparams with all vision and audio encoder hyperparameters (image size, patch size, embedding dimensions, attention heads, layer count, normalization epsilon, RoPE theta, mel bins for audio, image resolution candidates). It defines clip_layer containing per-layer weight tensors for attention (Q/K/V/output), feed-forward (up/gate/down), layer norms, layer scales, and Qwen3-VL deepstack merger weights. clip_model holds the full model with embedding tensors, projection weights for various projector types, conv1d weights for audio, and an array of clip_layer objects.

Usage

Used by all CLIP graph builders and the model loading code in clip.cpp to represent the in-memory model loaded from GGUF files.

Code Reference

Source Location

  • Repository: Ollama
  • File: llama/llama.cpp/tools/mtmd/clip-model.h
  • Lines: 1-300

Signature

struct clip_hparams {
    int32_t image_size = 0;
    int32_t patch_size = 0;
    int32_t n_embd = 0;
    int32_t n_ff = 0;
    int32_t projection_dim = 0;
    int32_t n_head = 0;
    int32_t n_layer = 0;
    float image_mean[3];
    float image_std[3];
    ffn_op_type ffn_op = FFN_GELU;
    float eps = 1e-6;
    float rope_theta = 0.0;
    int32_t n_mel_bins = 0;
    void set_limit_image_tokens(int n_tokens_min, int n_tokens_max);
    void set_warmup_n_tokens(int n_tokens);
};

struct clip_layer {
    ggml_tensor * k_w, * k_b, * q_w, * q_b, * v_w, * v_b;
    ggml_tensor * qkv_w, * qkv_b;
    ggml_tensor * o_w, * o_b;
    ggml_tensor * ln_1_w, * ln_1_b, * ln_2_w, * ln_2_b;
    ggml_tensor * ff_up_w, * ff_gate_w, * ff_down_w;
    bool has_deepstack() const;
};

struct clip_model {
    clip_modality modality = CLIP_MODALITY_VISION;
    projector_type proj_type = PROJECTOR_TYPE_MLP;
    clip_hparams hparams;
    std::vector<clip_layer> layers;
    ggml_tensor * position_embeddings;
    ggml_tensor * projection;
};

Import

#include "clip-model.h"

I/O Contract

Inputs

Name Type Required Description
GGUF model file binary Yes GGUF file containing model weights and hyperparameters

Outputs

Name Type Description
clip_hparams struct All vision/audio encoder hyperparameters
clip_layer struct Per-layer weight tensors for transformer blocks
clip_model struct Complete model with embeddings, layers, and projectors

Usage Examples

// Access model hyperparameters
const auto & hparams = model.hparams;
int n_patches = (hparams.image_size / hparams.patch_size) *
                (hparams.image_size / hparams.patch_size);

// Iterate over transformer layers
for (int il = 0; il < hparams.n_layer; il++) {
    auto & layer = model.layers[il];
    // Use layer.q_w, layer.k_w, layer.v_w for attention
}

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment