Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Ollama Ollama Clip Impl

From Leeroopedia
Knowledge Sources
Domains Multimodal, CLIP
Last Updated 2025-02-15 00:00 GMT

Overview

Internal implementation header for the CLIP multimodal system, defining constants, data structures, enumerations, and the clip context used throughout the vision/audio encoder pipeline.

Description

This header defines GGUF key constants for reading model metadata (embedding sizes, attention parameters, image/audio configuration), tensor name format strings for locating weights in GGUF files, the projector_type enumeration covering all supported projector architectures (MLP, LDP, MiniCPM-V, Qwen2VL, Gemma3, Pixtral, Ultravox, CogVLM, GLM4V, and more), image data structures (clip_image_u8 for raw RGB images, clip_image_f32 for preprocessed float images), batch types for processing multiple images, smart pointer deleters, and logging infrastructure. It also provides utility functions for string formatting, GGUF data parsing, and the main clip_ctx context struct.

Usage

Included internally by clip.cpp, clip-model.h, clip-graph.h, and multimodal model graph builders. Not intended for use outside the mtmd subsystem.

Code Reference

Source Location

  • Repository: Ollama
  • File: llama/llama.cpp/tools/mtmd/clip-impl.h
  • Lines: 1-511

Signature

enum projector_type {
    PROJECTOR_TYPE_MLP,
    PROJECTOR_TYPE_MLP_NORM,
    PROJECTOR_TYPE_LDP,
    PROJECTOR_TYPE_LDPV2,
    PROJECTOR_TYPE_MINICPMV,
    PROJECTOR_TYPE_GLM_EDGE,
    PROJECTOR_TYPE_QWEN2VL,
    PROJECTOR_TYPE_QWEN3VL,
    PROJECTOR_TYPE_GEMMA3,
    PROJECTOR_TYPE_IDEFICS3,
    PROJECTOR_TYPE_PIXTRAL,
    PROJECTOR_TYPE_ULTRAVOX,
    PROJECTOR_TYPE_INTERNVL,
    PROJECTOR_TYPE_LLAMA4,
    PROJECTOR_TYPE_COGVLM,
    PROJECTOR_TYPE_GLM4V,
    PROJECTOR_TYPE_UNKNOWN,
};

struct clip_image_u8 {
    int nx;
    int ny;
    std::vector<uint8_t> buf;
};

struct clip_image_f32 {
    int nx;
    int ny;
    std::vector<float> buf;
};

struct clip_image_f32_batch {
    std::vector<clip_image_f32_ptr> entries;
    bool is_audio = false;
    int grid_x = 0;
    int grid_y = 0;
};

Import

#include "clip-impl.h"

I/O Contract

Inputs

Name Type Required Description
GGUF keys string constants Yes Metadata keys for reading model hyperparameters from GGUF files
Tensor name patterns format strings Yes Printf-style patterns for locating weight tensors in GGUF

Outputs

Name Type Description
projector_type enum Identifies the projector architecture for a loaded model
clip_image_u8 struct Raw RGB image container (nx * ny * 3 bytes)
clip_image_f32 struct Preprocessed float image container (nx * ny * channels)
clip_image_f32_batch struct Batch of preprocessed images with optional grid layout

Usage Examples

// Access a projector type name
std::string name = PROJECTOR_TYPE_NAMES[PROJECTOR_TYPE_GEMMA3]; // "gemma3"

// Create a raw image container
clip_image_u8 img;
img.nx = 224;
img.ny = 224;
img.buf.resize(224 * 224 * 3);

// Use tensor name format
std::string tensor_name = string_format(TN_ATTN_QKV, "v", 0, "weight");

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment