Implementation:Ollama Ollama Clip Impl
| Knowledge Sources | |
|---|---|
| Domains | Multimodal, CLIP |
| Last Updated | 2025-02-15 00:00 GMT |
Overview
Internal implementation header for the CLIP multimodal system, defining constants, data structures, enumerations, and the clip context used throughout the vision/audio encoder pipeline.
Description
This header defines GGUF key constants for reading model metadata (embedding sizes, attention parameters, image/audio configuration), tensor name format strings for locating weights in GGUF files, the projector_type enumeration covering all supported projector architectures (MLP, LDP, MiniCPM-V, Qwen2VL, Gemma3, Pixtral, Ultravox, CogVLM, GLM4V, and more), image data structures (clip_image_u8 for raw RGB images, clip_image_f32 for preprocessed float images), batch types for processing multiple images, smart pointer deleters, and logging infrastructure. It also provides utility functions for string formatting, GGUF data parsing, and the main clip_ctx context struct.
Usage
Included internally by clip.cpp, clip-model.h, clip-graph.h, and multimodal model graph builders. Not intended for use outside the mtmd subsystem.
Code Reference
Source Location
- Repository: Ollama
- File: llama/llama.cpp/tools/mtmd/clip-impl.h
- Lines: 1-511
Signature
enum projector_type {
PROJECTOR_TYPE_MLP,
PROJECTOR_TYPE_MLP_NORM,
PROJECTOR_TYPE_LDP,
PROJECTOR_TYPE_LDPV2,
PROJECTOR_TYPE_MINICPMV,
PROJECTOR_TYPE_GLM_EDGE,
PROJECTOR_TYPE_QWEN2VL,
PROJECTOR_TYPE_QWEN3VL,
PROJECTOR_TYPE_GEMMA3,
PROJECTOR_TYPE_IDEFICS3,
PROJECTOR_TYPE_PIXTRAL,
PROJECTOR_TYPE_ULTRAVOX,
PROJECTOR_TYPE_INTERNVL,
PROJECTOR_TYPE_LLAMA4,
PROJECTOR_TYPE_COGVLM,
PROJECTOR_TYPE_GLM4V,
PROJECTOR_TYPE_UNKNOWN,
};
struct clip_image_u8 {
int nx;
int ny;
std::vector<uint8_t> buf;
};
struct clip_image_f32 {
int nx;
int ny;
std::vector<float> buf;
};
struct clip_image_f32_batch {
std::vector<clip_image_f32_ptr> entries;
bool is_audio = false;
int grid_x = 0;
int grid_y = 0;
};
Import
#include "clip-impl.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| GGUF keys | string constants | Yes | Metadata keys for reading model hyperparameters from GGUF files |
| Tensor name patterns | format strings | Yes | Printf-style patterns for locating weight tensors in GGUF |
Outputs
| Name | Type | Description |
|---|---|---|
| projector_type | enum | Identifies the projector architecture for a loaded model |
| clip_image_u8 | struct | Raw RGB image container (nx * ny * 3 bytes) |
| clip_image_f32 | struct | Preprocessed float image container (nx * ny * channels) |
| clip_image_f32_batch | struct | Batch of preprocessed images with optional grid layout |
Usage Examples
// Access a projector type name
std::string name = PROJECTOR_TYPE_NAMES[PROJECTOR_TYPE_GEMMA3]; // "gemma3"
// Create a raw image container
clip_image_u8 img;
img.nx = 224;
img.ny = 224;
img.buf.resize(224 * 224 * 3);
// Use tensor name format
std::string tensor_name = string_format(TN_ATTN_QKV, "v", 0, "weight");