Implementation:Ollama Ollama Clip Impl

Knowledge Sources	Ollama
Domains	Multimodal, CLIP
Last Updated	2025-02-15 00:00 GMT

Overview

Internal implementation header for the CLIP multimodal system, defining constants, data structures, enumerations, and the clip context used throughout the vision/audio encoder pipeline.

Description

This header defines GGUF key constants for reading model metadata (embedding sizes, attention parameters, image/audio configuration), tensor name format strings for locating weights in GGUF files, the projector_type enumeration covering all supported projector architectures (MLP, LDP, MiniCPM-V, Qwen2VL, Gemma3, Pixtral, Ultravox, CogVLM, GLM4V, and more), image data structures (clip_image_u8 for raw RGB images, clip_image_f32 for preprocessed float images), batch types for processing multiple images, smart pointer deleters, and logging infrastructure. It also provides utility functions for string formatting, GGUF data parsing, and the main clip_ctx context struct.

Usage

Included internally by clip.cpp, clip-model.h, clip-graph.h, and multimodal model graph builders. Not intended for use outside the mtmd subsystem.

Code Reference

Source Location

Repository: Ollama
File: llama/llama.cpp/tools/mtmd/clip-impl.h
Lines: 1-511

Signature

enum projector_type {
    PROJECTOR_TYPE_MLP,
    PROJECTOR_TYPE_MLP_NORM,
    PROJECTOR_TYPE_LDP,
    PROJECTOR_TYPE_LDPV2,
    PROJECTOR_TYPE_MINICPMV,
    PROJECTOR_TYPE_GLM_EDGE,
    PROJECTOR_TYPE_QWEN2VL,
    PROJECTOR_TYPE_QWEN3VL,
    PROJECTOR_TYPE_GEMMA3,
    PROJECTOR_TYPE_IDEFICS3,
    PROJECTOR_TYPE_PIXTRAL,
    PROJECTOR_TYPE_ULTRAVOX,
    PROJECTOR_TYPE_INTERNVL,
    PROJECTOR_TYPE_LLAMA4,
    PROJECTOR_TYPE_COGVLM,
    PROJECTOR_TYPE_GLM4V,
    PROJECTOR_TYPE_UNKNOWN,
};

struct clip_image_u8 {
    int nx;
    int ny;
    std::vector<uint8_t> buf;
};

struct clip_image_f32 {
    int nx;
    int ny;
    std::vector<float> buf;
};

struct clip_image_f32_batch {
    std::vector<clip_image_f32_ptr> entries;
    bool is_audio = false;
    int grid_x = 0;
    int grid_y = 0;
};

Import

#include "clip-impl.h"

I/O Contract

Inputs

Name	Type	Required	Description
GGUF keys	string constants	Yes	Metadata keys for reading model hyperparameters from GGUF files
Tensor name patterns	format strings	Yes	Printf-style patterns for locating weight tensors in GGUF

Outputs

Name	Type	Description
projector_type	enum	Identifies the projector architecture for a loaded model
clip_image_u8	struct	Raw RGB image container (nx * ny * 3 bytes)
clip_image_f32	struct	Preprocessed float image container (nx * ny * channels)
clip_image_f32_batch	struct	Batch of preprocessed images with optional grid layout

Usage Examples

// Access a projector type name
std::string name = PROJECTOR_TYPE_NAMES[PROJECTOR_TYPE_GEMMA3]; // "gemma3"

// Create a raw image container
clip_image_u8 img;
img.nx = 224;
img.ny = 224;
img.buf.resize(224 * 224 * 3);

// Use tensor name format
std::string tensor_name = string_format(TN_ATTN_QKV, "v", 0, "weight");

Related Pages

Principle:Ollama_Ollama_MultimodalPipeline

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment