Implementation:Ggml org Llama cpp CLIP Header

Knowledge Sources	Ggml_org_Llama_cpp
Domains	Multimodal, Vision
Last Updated	2026-02-15 00:00 GMT

Overview

Internal C API header for the CLIP module, defining the interface between the mtmd library and the CLIP vision/audio encoder implementation.

Description

This header declares forward types (`clip_ctx`, `clip_image_size`, `clip_image_f32`, etc.) and enumerations (`clip_modality` for vision/audio, `clip_flash_attn_type` for attention configuration). It defines `clip_context_params` for initialization settings (GPU usage, flash attention, token limits, warmup) and `clip_init_result` returning separate vision and audio contexts. The header exposes functions for model initialization/freeing, querying model properties (image size, patch size, hidden size, output token counts, mmproj embedding dimension), image memory management, image preprocessing, batch encoding, and projector type detection. It also supports audio input via mel spectrogram batch addition.

Usage

Use this header when working on the internal CLIP implementation or the mtmd library that wraps it. This is an internal header not intended for direct use by external consumers; applications should use the higher-level mtmd API instead.

Code Reference

Source Location

Repository: Ggml_org_Llama_cpp
File: tools/mtmd/clip.h
Lines: 1-121

Signature

struct clip_init_result clip_init(const char * fname, struct clip_context_params ctx_params);
void clip_free(struct clip_ctx * ctx);

size_t clip_embd_nbytes(const struct clip_ctx * ctx);
size_t clip_embd_nbytes_by_img(const struct clip_ctx * ctx, int img_w, int img_h);

int32_t clip_get_image_size(const struct clip_ctx * ctx);
int32_t clip_get_patch_size(const struct clip_ctx * ctx);
int32_t clip_get_hidden_size(const struct clip_ctx * ctx);

int clip_n_output_tokens(const struct clip_ctx * ctx, struct clip_image_f32 * img);
int clip_n_output_tokens_x(const struct clip_ctx * ctx, struct clip_image_f32 * img);
int clip_n_output_tokens_y(const struct clip_ctx * ctx, struct clip_image_f32 * img);
int clip_n_mmproj_embd(const struct clip_ctx * ctx);

bool clip_image_preprocess(struct clip_ctx * ctx, const struct clip_image_u8 * img, struct clip_image_f32_batch * res_imgs);
bool clip_image_encode(struct clip_ctx * ctx, int n_threads, struct clip_image_f32 * img, float * vec);
bool clip_image_batch_encode(struct clip_ctx * ctx, int n_threads, const struct clip_image_f32_batch * imgs, float * vec);

bool clip_has_vision_encoder(const struct clip_ctx * ctx);
bool clip_has_audio_encoder(const struct clip_ctx * ctx);

Import

#include "clip.h"

I/O Contract

Inputs

Name	Type	Required	Description
fname	const char *	Yes	Path to the CLIP model file
ctx_params	struct clip_context_params	Yes	Initialization parameters (GPU, flash attention, token limits)
ctx	struct clip_ctx *	Yes	Initialized CLIP context for queries and encoding
img	struct clip_image_u8 * / struct clip_image_f32 *	Yes	Input image in uint8 or float32 format
n_threads	int	Yes	Number of threads for encoding
vec	float *	Yes	Output buffer for image embeddings

Outputs

Name	Type	Description
clip_init	struct clip_init_result	Contains separate vision (ctx_v) and audio (ctx_a) contexts
clip_image_encode	bool	True on successful encoding of a single image
clip_image_batch_encode	bool	True on successful batch encoding
clip_n_output_tokens	int	Number of output tokens for the given image
clip_n_mmproj_embd	int	Dimension of the multimodal projector embedding (matches text model)

Usage Examples

// Initialize CLIP model
struct clip_context_params params = {
    .use_gpu = true,
    .flash_attn_type = CLIP_FLASH_ATTN_TYPE_AUTO,
    .warmup = true
};
struct clip_init_result result = clip_init("mmproj-model.gguf", params);
struct clip_ctx * ctx = result.ctx_v;

// Query model properties
int img_size = clip_get_image_size(ctx);
int n_embd = clip_n_mmproj_embd(ctx);

// Preprocess and encode an image
clip_image_preprocess(ctx, img_u8, &batch);
clip_image_batch_encode(ctx, 4, &batch, embeddings);

// Clean up
clip_free(ctx);

Related Pages

Principle:Ggml_org_Llama_cpp_Multimodal

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment