Implementation:Ollama Ollama Clip API
| Knowledge Sources | |
|---|---|
| Domains | Multimodal, API |
| Last Updated | 2025-02-15 00:00 GMT |
Overview
Internal C API header for the CLIP vision/audio encoder, defining the interface boundary between the mtmd multimodal library and the lower-level CLIP encoder implementation.
Description
Declares opaque types (clip_ctx, clip_image_size, clip_image_u8, clip_image_f32, clip_image_f32_batch), enumerations for modality (CLIP_MODALITY_VISION, CLIP_MODALITY_AUDIO) and flash attention modes, the clip_context_params struct, and the clip_init_result struct. Exposes the full C function API for initialization (clip_init), memory management (clip_free, image alloc/free), image data manipulation (clip_build_img_from_pixels), preprocessing (clip_image_preprocess), encoding (clip_image_encode, clip_image_batch_encode, clip_encode_float_image), and model property queries (image size, patch size, hidden size, output token counts, M-RoPE detection).
Usage
Included by mtmd.cpp and mtmd-helper.cpp to access the CLIP encoder. Marked as an internal header -- not intended for direct use outside the mtmd subsystem.
Code Reference
Source Location
- Repository: Ollama
- File: llama/llama.cpp/tools/mtmd/clip.h
- Lines: 1-118
Signature
enum clip_modality {
CLIP_MODALITY_VISION,
CLIP_MODALITY_AUDIO,
};
enum clip_flash_attn_type {
CLIP_FLASH_ATTN_TYPE_AUTO = -1,
CLIP_FLASH_ATTN_TYPE_DISABLED = 0,
CLIP_FLASH_ATTN_TYPE_ENABLED = 1,
};
struct clip_context_params {
bool use_gpu;
enum clip_flash_attn_type flash_attn_type;
int image_min_tokens;
int image_max_tokens;
bool warmup;
};
struct clip_init_result clip_init(const char * fname, struct clip_context_params ctx_params);
void clip_free(struct clip_ctx * ctx);
bool clip_image_preprocess(struct clip_ctx * ctx, const struct clip_image_u8 * img,
struct clip_image_f32_batch * res_imgs);
bool clip_image_encode(struct clip_ctx * ctx, int n_threads,
struct clip_image_f32 * img, float * vec);
int clip_n_output_tokens(const struct clip_ctx * ctx, struct clip_image_f32 * img);
int clip_n_mmproj_embd(const struct clip_ctx * ctx);
Import
#include "clip.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| fname | const char * | Yes | Path to GGUF model file |
| ctx_params | clip_context_params | Yes | Configuration for GPU, flash attention, token limits |
| img | clip_image_u8 * | Yes | Raw RGB image input |
Outputs
| Name | Type | Description |
|---|---|---|
| clip_init_result | struct | Contains vision and audio clip_ctx pointers |
| vec | float * | Encoded embedding vector |
| n_tokens | int | Number of output tokens for the given image |
Usage Examples
struct clip_context_params params = {true, CLIP_FLASH_ATTN_TYPE_AUTO, -1, -1, true};
struct clip_init_result result = clip_init("mmproj.gguf", params);
struct clip_ctx * ctx = result.ctx_v;
int n_embd = clip_n_mmproj_embd(ctx);
int patch_size = clip_get_patch_size(ctx);
bool has_vision = clip_has_vision_encoder(ctx);
bool has_audio = clip_has_audio_encoder(ctx);
clip_free(ctx);