Implementation:Ggml org Llama cpp CLIP Header
| Knowledge Sources | |
|---|---|
| Domains | Multimodal, Vision |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Internal C API header for the CLIP module, defining the interface between the mtmd library and the CLIP vision/audio encoder implementation.
Description
This header declares forward types (`clip_ctx`, `clip_image_size`, `clip_image_f32`, etc.) and enumerations (`clip_modality` for vision/audio, `clip_flash_attn_type` for attention configuration). It defines `clip_context_params` for initialization settings (GPU usage, flash attention, token limits, warmup) and `clip_init_result` returning separate vision and audio contexts. The header exposes functions for model initialization/freeing, querying model properties (image size, patch size, hidden size, output token counts, mmproj embedding dimension), image memory management, image preprocessing, batch encoding, and projector type detection. It also supports audio input via mel spectrogram batch addition.
Usage
Use this header when working on the internal CLIP implementation or the mtmd library that wraps it. This is an internal header not intended for direct use by external consumers; applications should use the higher-level mtmd API instead.
Code Reference
Source Location
- Repository: Ggml_org_Llama_cpp
- File: tools/mtmd/clip.h
- Lines: 1-121
Signature
struct clip_init_result clip_init(const char * fname, struct clip_context_params ctx_params);
void clip_free(struct clip_ctx * ctx);
size_t clip_embd_nbytes(const struct clip_ctx * ctx);
size_t clip_embd_nbytes_by_img(const struct clip_ctx * ctx, int img_w, int img_h);
int32_t clip_get_image_size(const struct clip_ctx * ctx);
int32_t clip_get_patch_size(const struct clip_ctx * ctx);
int32_t clip_get_hidden_size(const struct clip_ctx * ctx);
int clip_n_output_tokens(const struct clip_ctx * ctx, struct clip_image_f32 * img);
int clip_n_output_tokens_x(const struct clip_ctx * ctx, struct clip_image_f32 * img);
int clip_n_output_tokens_y(const struct clip_ctx * ctx, struct clip_image_f32 * img);
int clip_n_mmproj_embd(const struct clip_ctx * ctx);
bool clip_image_preprocess(struct clip_ctx * ctx, const struct clip_image_u8 * img, struct clip_image_f32_batch * res_imgs);
bool clip_image_encode(struct clip_ctx * ctx, int n_threads, struct clip_image_f32 * img, float * vec);
bool clip_image_batch_encode(struct clip_ctx * ctx, int n_threads, const struct clip_image_f32_batch * imgs, float * vec);
bool clip_has_vision_encoder(const struct clip_ctx * ctx);
bool clip_has_audio_encoder(const struct clip_ctx * ctx);
Import
#include "clip.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| fname | const char * | Yes | Path to the CLIP model file |
| ctx_params | struct clip_context_params | Yes | Initialization parameters (GPU, flash attention, token limits) |
| ctx | struct clip_ctx * | Yes | Initialized CLIP context for queries and encoding |
| img | struct clip_image_u8 * / struct clip_image_f32 * | Yes | Input image in uint8 or float32 format |
| n_threads | int | Yes | Number of threads for encoding |
| vec | float * | Yes | Output buffer for image embeddings |
Outputs
| Name | Type | Description |
|---|---|---|
| clip_init | struct clip_init_result | Contains separate vision (ctx_v) and audio (ctx_a) contexts |
| clip_image_encode | bool | True on successful encoding of a single image |
| clip_image_batch_encode | bool | True on successful batch encoding |
| clip_n_output_tokens | int | Number of output tokens for the given image |
| clip_n_mmproj_embd | int | Dimension of the multimodal projector embedding (matches text model) |
Usage Examples
// Initialize CLIP model
struct clip_context_params params = {
.use_gpu = true,
.flash_attn_type = CLIP_FLASH_ATTN_TYPE_AUTO,
.warmup = true
};
struct clip_init_result result = clip_init("mmproj-model.gguf", params);
struct clip_ctx * ctx = result.ctx_v;
// Query model properties
int img_size = clip_get_image_size(ctx);
int n_embd = clip_n_mmproj_embd(ctx);
// Preprocess and encode an image
clip_image_preprocess(ctx, img_u8, &batch);
clip_image_batch_encode(ctx, 4, &batch, embeddings);
// Clean up
clip_free(ctx);