Implementation:Ollama Ollama Clip Model

Knowledge Sources	Ollama
Domains	Multimodal, CLIP
Last Updated	2025-02-15 00:00 GMT

Overview

Defines the CLIP model data structures including hyperparameters, transformer layers, and the overall model for vision and audio encoders.

Description

This header defines clip_hparams with all vision and audio encoder hyperparameters (image size, patch size, embedding dimensions, attention heads, layer count, normalization epsilon, RoPE theta, mel bins for audio, image resolution candidates). It defines clip_layer containing per-layer weight tensors for attention (Q/K/V/output), feed-forward (up/gate/down), layer norms, layer scales, and Qwen3-VL deepstack merger weights. clip_model holds the full model with embedding tensors, projection weights for various projector types, conv1d weights for audio, and an array of clip_layer objects.

Usage

Used by all CLIP graph builders and the model loading code in clip.cpp to represent the in-memory model loaded from GGUF files.

Code Reference

Source Location

Repository: Ollama
File: llama/llama.cpp/tools/mtmd/clip-model.h
Lines: 1-300

Signature

struct clip_hparams {
    int32_t image_size = 0;
    int32_t patch_size = 0;
    int32_t n_embd = 0;
    int32_t n_ff = 0;
    int32_t projection_dim = 0;
    int32_t n_head = 0;
    int32_t n_layer = 0;
    float image_mean[3];
    float image_std[3];
    ffn_op_type ffn_op = FFN_GELU;
    float eps = 1e-6;
    float rope_theta = 0.0;
    int32_t n_mel_bins = 0;
    void set_limit_image_tokens(int n_tokens_min, int n_tokens_max);
    void set_warmup_n_tokens(int n_tokens);
};

struct clip_layer {
    ggml_tensor * k_w, * k_b, * q_w, * q_b, * v_w, * v_b;
    ggml_tensor * qkv_w, * qkv_b;
    ggml_tensor * o_w, * o_b;
    ggml_tensor * ln_1_w, * ln_1_b, * ln_2_w, * ln_2_b;
    ggml_tensor * ff_up_w, * ff_gate_w, * ff_down_w;
    bool has_deepstack() const;
};

struct clip_model {
    clip_modality modality = CLIP_MODALITY_VISION;
    projector_type proj_type = PROJECTOR_TYPE_MLP;
    clip_hparams hparams;
    std::vector<clip_layer> layers;
    ggml_tensor * position_embeddings;
    ggml_tensor * projection;
};

Import

#include "clip-model.h"

I/O Contract

Inputs

Name	Type	Required	Description
GGUF model file	binary	Yes	GGUF file containing model weights and hyperparameters

Outputs

Name	Type	Description
clip_hparams	struct	All vision/audio encoder hyperparameters
clip_layer	struct	Per-layer weight tensors for transformer blocks
clip_model	struct	Complete model with embeddings, layers, and projectors

Usage Examples

// Access model hyperparameters
const auto & hparams = model.hparams;
int n_patches = (hparams.image_size / hparams.patch_size) *
                (hparams.image_size / hparams.patch_size);

// Iterate over transformer layers
for (int il = 0; il < hparams.n_layer; il++) {
    auto & layer = model.layers[il];
    // Use layer.q_w, layer.k_w, layer.v_w for attention
}

Related Pages

Principle:Ollama_Ollama_MultimodalPipeline

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment