Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Ggml org Llama cpp CLIP Header

From Leeroopedia
Revision as of 12:38, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Ggml_org_Llama_cpp_CLIP_Header.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains Multimodal, Vision
Last Updated 2026-02-15 00:00 GMT

Overview

Internal C API header for the CLIP module, defining the interface between the mtmd library and the CLIP vision/audio encoder implementation.

Description

This header declares forward types (`clip_ctx`, `clip_image_size`, `clip_image_f32`, etc.) and enumerations (`clip_modality` for vision/audio, `clip_flash_attn_type` for attention configuration). It defines `clip_context_params` for initialization settings (GPU usage, flash attention, token limits, warmup) and `clip_init_result` returning separate vision and audio contexts. The header exposes functions for model initialization/freeing, querying model properties (image size, patch size, hidden size, output token counts, mmproj embedding dimension), image memory management, image preprocessing, batch encoding, and projector type detection. It also supports audio input via mel spectrogram batch addition.

Usage

Use this header when working on the internal CLIP implementation or the mtmd library that wraps it. This is an internal header not intended for direct use by external consumers; applications should use the higher-level mtmd API instead.

Code Reference

Source Location

Signature

struct clip_init_result clip_init(const char * fname, struct clip_context_params ctx_params);
void clip_free(struct clip_ctx * ctx);

size_t clip_embd_nbytes(const struct clip_ctx * ctx);
size_t clip_embd_nbytes_by_img(const struct clip_ctx * ctx, int img_w, int img_h);

int32_t clip_get_image_size(const struct clip_ctx * ctx);
int32_t clip_get_patch_size(const struct clip_ctx * ctx);
int32_t clip_get_hidden_size(const struct clip_ctx * ctx);

int clip_n_output_tokens(const struct clip_ctx * ctx, struct clip_image_f32 * img);
int clip_n_output_tokens_x(const struct clip_ctx * ctx, struct clip_image_f32 * img);
int clip_n_output_tokens_y(const struct clip_ctx * ctx, struct clip_image_f32 * img);
int clip_n_mmproj_embd(const struct clip_ctx * ctx);

bool clip_image_preprocess(struct clip_ctx * ctx, const struct clip_image_u8 * img, struct clip_image_f32_batch * res_imgs);
bool clip_image_encode(struct clip_ctx * ctx, int n_threads, struct clip_image_f32 * img, float * vec);
bool clip_image_batch_encode(struct clip_ctx * ctx, int n_threads, const struct clip_image_f32_batch * imgs, float * vec);

bool clip_has_vision_encoder(const struct clip_ctx * ctx);
bool clip_has_audio_encoder(const struct clip_ctx * ctx);

Import

#include "clip.h"

I/O Contract

Inputs

Name Type Required Description
fname const char * Yes Path to the CLIP model file
ctx_params struct clip_context_params Yes Initialization parameters (GPU, flash attention, token limits)
ctx struct clip_ctx * Yes Initialized CLIP context for queries and encoding
img struct clip_image_u8 * / struct clip_image_f32 * Yes Input image in uint8 or float32 format
n_threads int Yes Number of threads for encoding
vec float * Yes Output buffer for image embeddings

Outputs

Name Type Description
clip_init struct clip_init_result Contains separate vision (ctx_v) and audio (ctx_a) contexts
clip_image_encode bool True on successful encoding of a single image
clip_image_batch_encode bool True on successful batch encoding
clip_n_output_tokens int Number of output tokens for the given image
clip_n_mmproj_embd int Dimension of the multimodal projector embedding (matches text model)

Usage Examples

// Initialize CLIP model
struct clip_context_params params = {
    .use_gpu = true,
    .flash_attn_type = CLIP_FLASH_ATTN_TYPE_AUTO,
    .warmup = true
};
struct clip_init_result result = clip_init("mmproj-model.gguf", params);
struct clip_ctx * ctx = result.ctx_v;

// Query model properties
int img_size = clip_get_image_size(ctx);
int n_embd = clip_n_mmproj_embd(ctx);

// Preprocess and encode an image
clip_image_preprocess(ctx, img_u8, &batch);
clip_image_batch_encode(ctx, 4, &batch, embeddings);

// Clean up
clip_free(ctx);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment