Implementation:Ollama Ollama Clip API

Knowledge Sources	Ollama
Domains	Multimodal, API
Last Updated	2025-02-15 00:00 GMT

Overview

Internal C API header for the CLIP vision/audio encoder, defining the interface boundary between the mtmd multimodal library and the lower-level CLIP encoder implementation.

Description

Declares opaque types (clip_ctx, clip_image_size, clip_image_u8, clip_image_f32, clip_image_f32_batch), enumerations for modality (CLIP_MODALITY_VISION, CLIP_MODALITY_AUDIO) and flash attention modes, the clip_context_params struct, and the clip_init_result struct. Exposes the full C function API for initialization (clip_init), memory management (clip_free, image alloc/free), image data manipulation (clip_build_img_from_pixels), preprocessing (clip_image_preprocess), encoding (clip_image_encode, clip_image_batch_encode, clip_encode_float_image), and model property queries (image size, patch size, hidden size, output token counts, M-RoPE detection).

Usage

Included by mtmd.cpp and mtmd-helper.cpp to access the CLIP encoder. Marked as an internal header -- not intended for direct use outside the mtmd subsystem.

Code Reference

Source Location

Repository: Ollama
File: llama/llama.cpp/tools/mtmd/clip.h
Lines: 1-118

Signature

enum clip_modality {
    CLIP_MODALITY_VISION,
    CLIP_MODALITY_AUDIO,
};

enum clip_flash_attn_type {
    CLIP_FLASH_ATTN_TYPE_AUTO     = -1,
    CLIP_FLASH_ATTN_TYPE_DISABLED = 0,
    CLIP_FLASH_ATTN_TYPE_ENABLED  = 1,
};

struct clip_context_params {
    bool use_gpu;
    enum clip_flash_attn_type flash_attn_type;
    int image_min_tokens;
    int image_max_tokens;
    bool warmup;
};

struct clip_init_result clip_init(const char * fname, struct clip_context_params ctx_params);
void clip_free(struct clip_ctx * ctx);
bool clip_image_preprocess(struct clip_ctx * ctx, const struct clip_image_u8 * img,
                           struct clip_image_f32_batch * res_imgs);
bool clip_image_encode(struct clip_ctx * ctx, int n_threads,
                       struct clip_image_f32 * img, float * vec);
int clip_n_output_tokens(const struct clip_ctx * ctx, struct clip_image_f32 * img);
int clip_n_mmproj_embd(const struct clip_ctx * ctx);

Import

#include "clip.h"

I/O Contract

Inputs

Name	Type	Required	Description
fname	const char *	Yes	Path to GGUF model file
ctx_params	clip_context_params	Yes	Configuration for GPU, flash attention, token limits
img	clip_image_u8 *	Yes	Raw RGB image input

Outputs

Name	Type	Description
clip_init_result	struct	Contains vision and audio clip_ctx pointers
vec	float *	Encoded embedding vector
n_tokens	int	Number of output tokens for the given image

Usage Examples

struct clip_context_params params = {true, CLIP_FLASH_ATTN_TYPE_AUTO, -1, -1, true};
struct clip_init_result result = clip_init("mmproj.gguf", params);
struct clip_ctx * ctx = result.ctx_v;

int n_embd = clip_n_mmproj_embd(ctx);
int patch_size = clip_get_patch_size(ctx);
bool has_vision = clip_has_vision_encoder(ctx);
bool has_audio = clip_has_audio_encoder(ctx);

clip_free(ctx);

Related Pages

Principle:Ollama_Ollama_MultimodalPipeline

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment